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Abstract 


This paper addresses the problem of detecting and characterizing lo- 
cal variability in time series and other forms of sequential data. The 
goal is to identify and characterize statistically significant variations, at 
the same time suppressing the inevitable corrupting observational errors. 
We present a simple nonparametric modeling technique and an algo- 
rithm implementing it — an improved and generalized version of Bayesian 
Blocks [Scargle 1998] — that finds the optimal segmentation of the data 
in the observation interval. The structure of the algorithm allows it 
to be used in either a real-time trigger mode, or a retrospective mode. 
Maximum likelihood or marginal posterior functions to measure model 
fitness are presented for events, binned counts, and measurements at ar- 
bitrary times with known error distributions. Problems addressed include 
those connected with data gaps, variable exposure, extension to piece- 
wise linear and piecewise exponential representations, multi-variate time 
series data, analysis of variance, data on the circle, other data modes, 
and dispersed data. Simulations provide evidence that the detection effi- 
ciency for weak signals is close to a theoretical asymptotic limit derived 
by |Arias-Castro, Donoho and Huo 2003] . In the spirit of Reproducible 


Research Donoho et al. (2008) all of the code and data necessary to re- 
produce all of the figures in this paper are included as auxiliary material. 


Keywords: time series, signal detection, triggers, transients, Bayesian 
analysis 
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“The line is similar to a length of time, and as 
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so the instants are the endpoints of any given ex- 
tension of time.” Leonardo da Vinci, Codex Arundel, 
folio 190v., c. 1500 A.D. [Capra 2007 . 
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1 The Data Analysis Setting 


This paper describes a method for nonparametric analysis of time 
series data to detect and characterize structure localized in time. 
Nonparametric methods seek generic representations, in contrast to 
fitting of models to the data. By local structure we mean light-curve 
features occupying sub-ranges of the total observation interval, in 
contrast to global signals present all or most of the time (e.g. pe- 
riodicities) for which Fourier, wavelet, or other transform methods 
are more appropriate. The goal is to separate statistically signif- 
icant features from the ever-present random observational errors. 
Although phrased in the time-domain the discussion throughout is 
applicable to measurements sequential in wavelength, spatial quan- 
tities, or any other other independent variable. 

This setting leads to the following desiderata: The ideal algo- 
rithm would impose as few preconditions as possible, avoiding as- 
sumptions about smoothness or shape of the signal that place a pri- 
ori limitations on scales and resolution. The algorithm should han- 
dle arbitrary sampling (i.e., not be limited to gapless, evenly spaced 
data) and large dynamic ranges in amplitude, time scale and signal- 
to-noise. For scientific data mining applications and for objectivity, 
the method should be largely automatic. To the extent possible it 
should suppress observational errors while preserving whatever valid 
information lies in the data. It should be applicable to multivari- 
ate problems. It should incorporate variation of the exposure or 
instrumental efficiency during the measurement, as well auxiliary, 
extrinsic information, e.g. spectral or color information. It should 
be able to operate both retrospectively (analyze all the data after 
they are collected) and in a real-time fashion that triggers on the 
first significant variation of the signal from its background level. 

The algorithm described here achieves considerable success in 
each of these desired features. In a simple and easy-to-use com- 
putational framework it represents the structure in the signal in a 
form handy for further analysis and the estimation of physically 
meaningful quantities. It includes an automatic penalty for model 
complexity, thus solving the vexing problems associated with model 
comparison in general and determining the order of the model in par- 
ticular. It is exact, not a greed^ approximation as in |Scargle 1998 . 

1 This term refers to algorithms that greedily make optimal improvements at each iteration 
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Versions of this algorithm have been used in various applications, 


such as [Qin et al. 2012, Norris Gehrels and Scargle 2010, Norris Gehrels and Scargle 2011 


Way Gazis and Scarg le 2011 

The following sections discuss, in turn, the basis of segmenta- 
tion analysis ( §l.ip . the piecewise constant model adopted in this 
work fi ll. 211 . extensions to piecewise linear and piecewise exponen- 
tial models the types of data that the algorithm can accept 

(§ §1.51 and 11 .6(1 . data gaps ( Nl.7[) . exposure variations ( §1.8p . a pa- 
rameter from the prior on the number of blocks ( §1.9p . generalities 
of optimal segmentation of data into blocks (£J2]), some error analysis 
( §2.8p . a variety of block fitness functions (1J3]), and sample applica- 
tions (§SJ). Appendices present some MatLab © code, some miscel- 
laneous results, and details of other data modes, including dispersed 
data F JC.8D . Ancillary hies are available providing scripts and data 
in order to reproduce all of the figures in this paper. 


1.1 Optimal Segmentation Analysis 

The above considerations point toward the most generic possible 
nonparametric data model, and have motivated the developmenVof 


data segmentation and change-point methods - see e.g 
Scargle 1998j. These methods represent the signal structure in terms 
of a segmentation of the time interval into blocks, with each block 
containing consecutive data elements satisfying some well defined 
criterion. The optimal segmentation is that which maximizes some 
quantitative expression of the criterion - for example the sum over 
blocks of a goodness-of-ht of a simple model of the data lying in 
each block. 

These concepts and methods can be applied in surprisingly gen- 
eral, higher dimensional contexts. Here, however, we concentrate 
on one-dimensional data ordered sequentially with respect to time 
or some other independent variable. In this setting segmentation 
analysis is often called change-point detection, since it implements 
models in which a signal’s statistical properties change discontinu- 
ously at discrete times but are constant in the segments between 
these change-points (see ? 12.5I) . 


O Ruanaidh and Fitzgerald 1996 


but are not guaranteed to converge to a globally optimal solution. 
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1.2 The Piecewise Constant Model 

It is remarkable that all of the desiderata outlined in the previ- 
ous section can be achieved in large degree by optimal fitting of a 
piecewise-constant model to the data. The range of the independent 
variable ( e.g . time) is divided into subintervals (here called blocks ) 
generally unequal in size, in which the dependent variable (e.g. in- 
tensity) is modeled as constant within errors. Of all possible such 
“step-functions” this approach yields the best one by maximizing 
some goodness-of-fit measure. 

Defining the times ending one block and starting the next as 
change-points , the model of the whole observation interval contains 
these parameters: 

(1) N cp : the number of change-points 

(2) C the change-point starting block k 

(3) X k : the signal amplitude in block k 

for k = 1,2,... N cp + 10 The key idea is that the blocks can be 
treated independently, in the sense that a block’s fitness depends 
on its data only. Our simple model for each block has effectively 
two parameters. The first represents the signal amplitude, and is 
treated as a nuisance parameter to be determined after the change- 
points have been located. The second parameter is the length of the 
interval spanned by the block. (The actual start and stop times of 
this interval are needed for piecing blocks together to form the final 
signal representation, but not for the fitness calculation.) 

How many blocks? A key issue is how to determine the number 
of blocks, Nuocks = N cp + 1. Nonparametric analysis invariably 
involves controlling in one way or another the complexity of the 
estimated representation. Typically such regulation is considered 
a trade-off of bias and variance, often implemented by adjusting a 
smoothing parameter. 

But smoothing is one of the very things we are trying to avoid. 
The discontinuities at the block edges are regarded as assets, not 

2 There is one more block than there are change- points: The first datum is always consid- 
ered a change-point, marking the start of the first block, and is therefore not a free parameter. 
If the last datum is a change-point, it denotes a block consisting solely of that datum. 
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liabilities to be smoothed over. So rather than smooth we influ- 
ence the number of blocks by defining a prior distribution for the 
number of blocks. Adjusting a parameter controlling the steepness 
of this prior establishes relative probabilities of smaller or larger 
numbers of blocks. In the usual fashion for Bayesian model se- 
lection in cases with high signal-to-noise Nbiocks is determined by 
the structure of the signal; with lower signal-to-noise the prior be- 
comes more and more important. In short, we are regulating not 
smoothness but complexity, much in the way that wavelet denois- 
ing [jDonoho and Johnstone 1998j operates without smoothing over 
sharp features as long as they are supported by the data. The 
adopted prior and the determination of its parameter are discussed 
in JCl below. 

This segmented representation is in the spirit of nonparametric 
approximation and not meant to imply that we believe the signal 
is actually discontinuous. The sometimes crude and blocky appear- 
ance of this model may be awkward in visualization contexts, but for 
deriving physically meaningful quantities it is not. Blocky models 
are broadly useful in signal processing [Donoho 1994] and have sev- 
eral motivations. Their simplicity allows exact treatment of various 
quantities, such as the likelihood. We can optimize or marginalize 
the rate parameters exactly, giving simple formulas for the fitness 
function (see g2]and Appendix Cd JUli . And in many applications the 
estimated model itself is less important than quantities derived from 
it. For example, while smoothed plots of pulses within gamma-ray 
bursts make pretty pictures, one is really interested in pulse loca- 
tions, lags, amplitudes, widths, rise and decay times, etc. All these 
quantities can be determined directly from the locations, heights 
and widths of the blocks - accurately and free of any smoothness 
assumptions. 


1.3 Piecewise Linear and Exponential Models 


Some researchers have applied segmentation methods with other 
block representations. For example piecewise linear models have 
been used in measuring similarity among time series and in pat- 
tern matching |Lin, Keogh, Lonardi and Chiu 2003 and to repre- 
sent time series generated by non-linear processes Tong 1990 . While 


such models may seem better than discontinuous step functions, 
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their improved flexibility is somewhat offset by added complexity of 
the model and its interpretation. Note further that if continuity is 
imposed at the change-points, a piecewise linear model has essen- 
tially the same number of degrees of freedom as does the simpler 
piecewise constant model. 

We mention two such generalizations, one modeling the signal as 
linear in time across the block: 

x(t) = A(1 + a(t - tfid)) (1) 

and the second as exponential: 

x(t) = Ae a(t -*fid) . (2) 

In both cases A is the signal strength at the fiducial time tgg and 
the coefficient a determines the rate of variation over the block. 
Such models may be useful in spite of the caveats mentioned above 
and the added complexity of the block fitness functions. Hence we 
provide some details in Appendix C, ^C.9l and lC.10l 

1.4 Histograms 

For event data the piecewise constant representation can be inter- 
preted as a histogram of the measured values — one in which the 
bins are not fixed ahead of time and are free to be unequal in size 
as determined by the data. In this context the time order of the 
measurements is irrelevant. Once one determines the parameter in 
the prior on the number of bins, ncp_prior, one has an objective 
histogram procedure in which the number, individual sizes, and lo- 
cations of the bins are determined solely and uniquely by the data. 


1.5 Data Modes 


The algorithms developed here can be used with a variety of types 
of data, often called data modes in instrumentation contexts. An 
earlier paper |Scargle 1998 [ described several, with formulas for the 
corresponding fitness functions. Here we discuss data modes in a 
broader perspective. It is required that the measurements provide 
sufficient information to determine which block they belong to and 
then to compute the model fitness function for the block (cf. 
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Almost any physical variable and any measurement scheme for it, 
discrete or quasi- continuous, can be accommodated. In the simple 
one dimensional case treated here, the independent variable is time, 
wavelength, or some other quantity. The data space is the domain 
of this variable over which measurements were made - typically an 
interval, possibly interrupted by gaps during which the measuring 
system was not operating. 

The measured quantity can be almost anything that yields infor- 
mation about the target signal. The three most common examples 
emphasized here are: (a) times of events (often called time-tagged 
event data, or TTE), (b) counts of events in time bins, and (c) mea- 
surements of a quasi-continuous observable at a sequence of points 
in time. For the first two cases the signal of interest is the event 
rate , proportional to the probability distribution regulating events 
which occur at discrete times due to the nature of the astrophys- 
ical process and/or the way it is recorded. We call case (c) point 
measurements, not to be confused with point data (also called event 
data). These modes have much in common, as they all comprise 
measurements that can be at any time; what differentiates them is 
their statistics, roughly speaking Bernoulli, Poisson, and Gaussian 
(or perhaps some other) respectively. 

The archetypal example of (a) is light collected by a telescope and 
recorded as a set of detection times of individual photons to study 
source variability. Case (b) is similar, but with the events collected 
into bins - which do not have to be equal or evenly spaced. Case 
(c) is common when photons are not detected individually, such as 
in radio flux measurements. In all cases it is useful to represent the 
measurements with data cells, typically one for each measurement 
(see 1 12. 2p . In principle mixtures of cells from different data types 
can be handled, as described in the next section. 

1.6 Mixed Data Modes 

Our algorithm can analyze mixtures of any data types within a single 
time series. For example the data stream could consist of arbitrary 
combinations of cells of the three types defined above - measured 
values, counts in bins and events - with or without overlap in time 
among the various data modes. In regions of overlap the block 
representation would be based on the combined data; otherwise it 
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would represent block structures supported by the corresponding 
individual data modes. In such applications the cost function must 
refer to a common signal amplitude parameter, possibly including 
normalization factors to account for differences in the measurement 
processes. 

A related concept is that of multivariate time series , usually re- 
ferring to concurrent observations from different telescopes. The 
distinction between this concept and mixed data modes is largely 
semantic. Hence we leave implimcntation details to 1 14.21 

1.7 Gaps 

In some cases there are subintervals over which no data can be 
obtained. For example there may be random interruptions such as 
detector malfunction at unpredictable but known periods of time, 
or regular interruptions as the Earth periodically blocks the view 
of an object from an orbiting space observatory. (Of course this 
case is very different from intervals in which no events happened to 
be detected, due to low event rate, or in which one simply did not 
happen to make point measurements. 

Such data gaps have a nearly invisible affect on the algorithm, 
fundamentally due to the fact that it operates locally in the time 
domain. For event data all that matters is the live time during the 
block, i.e. the time over which data could have been registered. 
Other than correcting the total time span of any putative block 
containing data gaps by subtracting the corresponding dead time, 
gaps can be handled by ignoring them. Operationally one simply 
treats the data right after a gap as immediately following the data 
right before it (and not delayed by the length of the gap). Think of 
this as squeezing the interval to eliminate the gaps, carrying out the 
analysis as if no gaps are present, and then un-doing the squeezing 
by restoring the original times. This procedure is valid because event 
independence means that the fitness of a block depends on only its 
total live-time and the events within it. 

For event data this squeezing can be implemented by subtracting 
from each event time the sum of the lengths of all the preceding gaps. 
One small detail concerns the points just before and just after a gap. 
One might think their time intervals should be computed relative to 
the gap edges. But it follows from the nature of independent events 
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(Appendix B, §|Bj that they can be computed as though the gap did 
not exist. 

The only other subtlety lies in interpreting the model in and 
around gaps. There are two possibilities: a given gap (a) may lie 
completely within a block or (b) it may separate two blocks. Case 
(a) can be taken as evidence that the event rates before and after the 
gap are deemed the same within statistical fluctuations. Case (b) on 
the other hand implies that the event rate did change significantly. 

Of course the gaps must be restored for display and other post- 
processing analysis. Think of this as un-squeezing the data so that 
all blocks appear at their correct locations in time. Keeping in mind 
that there is no direct information about what happened during 
unobserved intervals, plots should probably include some indication 
that rates within gaps are unknown or uncertain, such as by use of 
dotted lines or shading in the gap for case (a) or leaving the gap 
interval completely blank in case (b). 

For the case of point measurements the situation is different. In 
one sense there are no gaps at all, and in another sense the entire 
observation interval consists of many gaps separating tiny intervals 
over which the measurements were actually made. One is hard- 
pressed to make a statistical distinction between various reasons 
why there is not a measurement at a given time - e.g. detector 
and weather problems, or simply a choice as to how to allocate 
observing time (a choice that may even depend on the results of 
analyzing previous data). Basically the blocks in this case represent 
intervals where whatever measurements were made in the interval 
are consistent with a signal that is constant over that interval. 

Note that things would be different if one wanted to define a 
fitness function dependent on the total length of the block, not just 
its live time. This would arise for example if a prior on the block 
length were imposed. Such possibilities will not be discussed here. 

1.8 Exposure Variations 

In some applications the effective instrument response is not con- 
stant. The measurements then reflect true source variations modi- 
fied by changes in overall throughput of the detection system. We 
use the term exposure for any such effect on the detected signal - 
e.g. detector efficiency, telescope effective area, beam pattern and 
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point spread function effects. Exposure can be quantified by the 
ratio of the expected signal with and without any such effects. It 
may be calculable from properties of the observing system, deter- 
mined after the fact through some sort of calibration procedure, or 
a combination of the two. Here we assume that this ratio is known 
and expressed as a number e n , typically with 0 < e n < 1, for data 
cell 7i. 

The adjustment for exposure is simple, namely change the pa- 
rameter representing the observed signal amplitude in the likelihood 
to what it would have been if the exposure had been unity. First 
compute the exposure e n for data cell n. Then increase by the fac- 
tor l/e n whatever quantity in the data cell represents the measured 
signal intensity. Specifically, for time-tagged event data this pa- 
rameter is the reciprocal of the interval of the corresponding data 
cell: 1/A t n (see eq. (l^UD ). which is then replaced with l/(e n A t n ). 
For bin counts the bin size can be multiplied by e n or equivalently 
the count by l/e n . For point measurements multiply the amplitude 
measurement by l/e n (and adjust any observational error parame- 
ters accordingly). In all cases the goal is to represent the data as 
closely as possible to what it would have been if the exposure had 
been constant. Of course this restoration is not exact in individual 
cases, but is correct on average. 

For TTE data the fact that interval A t n as we define it in eq. 
(12UD depends on the times of two different events (just previous to 
and just after the one under consideration) may seem to pose a 
problem. The exposures of these events will in general be different, 
so what value do we use for the given event? The comforting answer 
is that the only relevant exposure is that for the given event itself. 
For consider the interval from the previous to the current time, 
namely t n — t n _i. Here t n ^\ is regarded as simply a fiducial time 
and the distribution of this interval is given by eq. (H7|) with A 
the true rate adjusted by the exposure for event n, by the principle 
described in < IB.5I just after this equation. Similarly by a time- 
reversal invariance argument the distribution of the interval to the 
subsequent event, namely t n+ \ — t n , also depends on only the same 
quantity. In summary event independence (Appendix C, jJC]) yields 
the somewhat counterintuitive fact that the probability distribution 
of A t n = {t n+ i — t n _i)/2 of the interval surrounding event n depends 
on only the effective event rate for event n. 
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1.9 Prior for the Number of Blocks 


Earlier work [Scargle 1998] did not assign an explicit prior probabil- 
ity distribution for the number of blocks, i.e. the parameter N b i oc ks- 
This omission amounts to using a flat prior, but in many contexts 
it is unreasonable to assign the same prior probability to all values. 
In particular, in most settings it is much more likely a priori that 
N blocks « N than that Nbiocks ~ N. For this reason it is desirable 
to impose a prior that assigns smaller probability to a large number 
of blocks, and we adopt this geometric prior jCoram 2002] : 


for 0 < Nbi oc ks < N, and zero otherwise since N b i oc ks cannot be 
negative or larger than the number of data cells. The normalization 
constant P 0 is easily obtained, giving 


Through this prior the parameter 7 influences the number of blocks 
in the optimal representation - a number of some importance since 
it affects the visual appearance of the representation and to a lesser 
extent the values of quantities derived from it. This form for the 
distribution dictates that Ending k + 1 blocks is less likely by the 
constant factor 7 than is Ending k blocks. In almost all applications 
7 will be chosen < 1 to express that a smaller number of blocks is a 
priori more likely than a larger number. 

In principle the choice of a prior and the values of its param- 
eters expresses one’s prior knowledge in a specific problem. The 
convenient geometric prior adopted here has proven to be generic 
and flexible, and its parameterization is simple and straightforward. 
These properties are appropriate for a generic analysis tool meant 
for a wide variety of applications. One can think of selecting 7 as a 
simple way of adjusting the amount of structure in the block repre- 
sentation of the signal. It is specifically not a smoothing parameter 
but is analogous to one. 

The expected number of blocks follows from eq. ([3]) 


P(N blocks ) = P Q ^ocks 


( 3 ) 


P (N b i ocks ) 



( 4 ) 


N 



< N blocks > = P 0 N blocksl Nbl ° cks 


N blocks — 0 
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Note that the actual number of blocks is a discontinuous, monotonic 
function of 7, and because its jumps can be > 1 it is not generally 
possible to force a prescribed number by adjusting this parameter. 

The above prior is not the only one possible and different forms 
may be useful in some applications. But Eq. (J3J) is very convenient 
to implement, since with the fitness equal to the log of the posterior, 
one only needs to subtract the constant —log 7 (called ncp_prior 
in the MatLab code and in the discussion of computational issues 
below) from the fitness of each block. Determining the value to use 
in applications is discussed in < 12.71 below. 

2 Optimum Segmentation of Data on an Inter- 
val 

Piecewise constant modeling of sequential measurements on a time 
interval T is most conveniently implemented by seeking an optimal 
partition of the ordered set of data cells within T. In this special 
case of segmentation the segments cover the whole set with no over- 
lap between them (Appendix B). Segmentations with overlap are 
possible, for example in the case of correlated measurements, but 
are not considered here. One can envision our quest for the optimal 
segmentation as nothing more than finding the best step-function, 
or piecewise constant model, fit to the data — defined by maximizing 
a specific fitness measure as detailed in 1 12.41 

We introduce our algorithm in a somewhat abstract setting be- 
cause the formalism developed here applies to other data analy- 
sis problems beyond time series analysis. It implements Bayesian 
Blocks or other ID segmentation ideas for any model fitness function 
that satisfies a simple additivity condition. It improves the previous 
approximate segmentation algorithm |Scargle 1998) by achieving an 
exact, rigorous solution of the multiple change-point problem, guar- 
anteed to be a global optimum, not just a local one. 

The rest of this section describes how the model is structured for 
effective solution of this problem, while details on the quantity to 
be optimized are deferred to the next section. 

2.1 Partitions 

Partitions of a time interval T are simply collections of non-overlapping 
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blocks (defined below in 1 12.31) . defined by specifying the number of 
its blocks and the block edges: 

J’(-f) = {N, blocks ; Tiki k — 1, 2, 3, . . . Nbi oc k s } . (6) 

where the are indices of the data cells f f !2 .2jl defining times called 
change-points (see 1 12. 5p . 

Appendix B gives a few mathematical details about partitions, 
including justification of the restriction of the change-points to coin- 
cide with data points and the result that the number of possible par- 
titions (i.e. the number of ways N cells can be arranged in blocks) 
is 2 n . This number is exponentially large, rendering an explicit 
exhaustive search of partition space utterly impossible for all but 
very small N. Our algorithm implicitly performs a complete 
search of this space in time of order N 2 , and is practical even 
for N ~ 1, 000, 000, for which approximately 10 300 > 000 partitions are 
possible. The beauty of the algorithm is that it finds the optimum 
among all partitions without an exhaustive explicit search, which is 
obviously impossible for almost any value of N arising in practice. 

2.2 Data Cells 

For input to the algorithm the measurements are represented with 
data cells. For the most part there is one cell for each measure- 
ment, although in the case of TTE data two or more events with 
identical time-tags may be combined into a single cell. A conve- 
nient data structure is an array containing the cells ordered by the 
measurement times. 

Specification of the contents of the cells must meet two require- 
ments. First they must include time information allowing determi- 
nation of which cells lie in a block given its start and stop times. 
Post-processing steps such as plotting the blocks may in addition 
use the actual times, either absolute or relative to a specified origin. 

The other requirement is that the fitness of a block can be com- 
puted from the contents of all the cells in it 1 1 12. 41 §3]). For the three 
standard cases the relevant data are roughly speaking: (a) intervals 
between events f fi3.1|h (b) bin sizes, locations and counts f d3. 2j) . and 
(c) measured values augmented by a quantifier of measurement un- 
certainty f £13.3p . These same quantities enable construction of the 
resulting step function for post-processing steps such as computing 


15 


signal parameters. 


2.3 Blocks of Cells 

A block is any set of consecutive cells, either an element of the op- 
timal representation or a candidate for it. Each block represents a 
subinterval (within the full range of observation times) over which 
the amplitude of the signal can be estimated from the contents of 
its cells ( §2.2h . A block can be as small as one cell or as large as all 
of the cells. 

Our time series model consists of a set of blocks partitioning the 
observations. All model parameters are constant within each block 
but undergo discrete jumps at the change-points i H2.5[) marking the 
edges of the blocks. The model is visualized by plotting rectangles 
spanning the intervals covered by the blocks, each with height equal 
to the signal intensity averaged over the interval. The concept of 
fitness of a block is fundamental to everything else in this paper. As 
we will see in the next section the fitness of a partition is the sum 
of the fitnesses of the blocks comprising it. 

2.4 Fitness of Blocks and Partitions 

Since the goal is to represent the data as well as possible within a 
given class of models, we maximize a quantity measuring the fitness 
of models in the given class, here the class of all piecewise constant 
models. Alternatively, one can minimize an error measure. Both op- 
erations are called optimization. The algorithm relics on the fitness 
being block-additive, i.e. 

N blocks 

iWT)]= £ /(£>*) , (7) 

k = 1 

where F[‘J > (T)\ is the total fitness of the partition fP of interval T, 
and f(Bk) is the fitness of block k. The latter can be any convenient 
measure of how well a constant signal represents the data within 
the block. Typically additivity results from independence of the 
observational errors. We here ignore the possibility of correlated 
errors, which could make the fitness of one block depend on that of 
its neighbors. Remember correlation of observational errors is quite 
separate from correlations in the signal itself. 
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All model parameters are marginalized except the rik specifying 
block edges. Then the total fitness depends on only these remaining 
parameters - i.e. on the detailed specification of the partition by 
indicating which cells fall in each of its blocks. The best model is 
found by maximizing F over all possible such partitions. 

2.5 Change-points 

In the time series literature a point at which a statistical model 
undergoes an abrupt transition, by one or more of its parameters 
jumping instantaneously to a new value, is called a change-point. 
This is exactly what happens at the edges of the blocks in our model. 
In principle change-points can be at arbitrary times. However, fol- 
lowing the data cell representation and without any essential loss 
of generality they can be restricted to coincide with a data point 
(Appendix B; gHD . 

A few comments on notation are in order. We take blocks to start 
at the data cell identified by the algorithm as a change-point and 
to end at the cell previous to the subsequent change-point. A slight 
variation of this convention is discussed below in 1 14.41 in connection 
with allowing the possibility of empty blocks in the context of event 
data. One might adopt other conventions, such as apportioning the 
change-point data cell to both blocks, but we do not do so here. 
Even though the first data cell in the time series always starts the 
first block, our convention is that it is not considered a change-point. 
In the code presented here the first change-point marks the start of 
the second block. For k > 1 the fc-th block starts at index n^-i and 
ends at nfc — 1. The first block always starts with the very first data 
cell. The last block always terminates with the very last data cell. 
If the last cell is a change-point, it defines a block consisting of only 
that one cell. The set of change-points is empty if the best model 
consists of a single block, meaning that the time series is sensibly 
constant over the whole observation interval. The number of blocks 
is one greater than the number of change-points. 


17 


2.6 The Algorithm 


We now outline the basic algorithm yielding the desired optimum 
partitions. The details of this dynamic programming approach 
[Bellman 19611 |Hubert, Arabie, and Meulman 2001[ |Dreyfus 2002| 
are in [Jackson et al. 2005] . It follows the spirit of mathematical 
induction: beginning with the first data cell, at each step one more 
cell is added. The analysis makes use of results stored from all 
previous steps. Remarkably the algorithm is exact and yields the 
optimal partition of an exponentially large space in time of order 
N 2 . The iterations normally continue until the whole interval has 
been analyzed. However its recursive nature allows the algorithm 
to function in a trigger mode , halting when the first change-point is 
detected ( §4.3h . 

Let iP op# (i?) denote the optimal partition of the first R cells. In 
the starting case R — 1 the only possible partition (one block con- 
sisting of the first cell by itself) is trivially optimal. Now assume we 
have completed step R, identifying the optimal partition iP° p 4 (i?). 
At this (and each previous) step store the value of optimal fitness 
itself in array best and the location of the last change-point of the 
optimal partition in array last. 

It remains to show how to obtain CP opf (i?-l-l). For some r consider 
the set of all partitions (of these first R + 1 cells) whose last block 
starts with cell r (and by definition ends at R+ 1). Denote the fitness 
of this last block by F(r). By the subpartition result in Appendix 
B the only member of this set that could possibly be optimal is 
that consisting of < S >opt {r — 1) followed by this last block. By the 
additivity in Eq. ([7J) the fitness of said partition is the sum of F(r) 


3 Bellman’s explanation (before the word “programming” took on its current computational 
connotation) of how he chose this name is interesting. The Secretary of Defense at the time 
"... had a pathological fear and hatred of the word, research. ... You can imagine how he 
felt, then, about the term, mathematical. ... I felt I had to do something to shield ... the 
Air Force from the fact that I was really doing mathematics inside the RAND Corporation. 
... I was interested in planning ... But planning is not a good word for various reasons. I 
decided therefore to use the word, programming. I wanted to get across the idea that this was 
dynamic ... it’s impossible to use the word, dynamic, in a pejorative sense. Try thinking of 
some combination that will possibly give it a pejorative meaning. It’s impossible. Thus, I 
thought dynamic programming was a good name. It was something not even a Congressman 
could object to. 
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( 8 ) 


and the fitness of (P° pt (r — 1) saved from a previous step: 

A{r) = F(r) + { best(r - 1), r = 2, 3, . . . , R + 1 . 

.4(1) is the special case where the last block comprises the entire 
data array and thus no previous fitness value is needed. Over the 
indicated range of r this equation expresses the fitness of all par- 
titions (P(R+1) that can possibly be optimal. Hence the value of 
r yielding the optimal partition + 1) is the easily computed 

value maximizing A(r): 

r° pt = argmax[4(r)] . (9) 

At the end of this computation, when R = N, it only remains to 
find the locations of the change-points of the optimal partition. The 
needed information is contained in the array last in which we have 
stored the index r opt at each step. Using the corollary in Appendix B 
it is a simple matter to use the last value in this array to determine 
the last change-point in P opt (N ), peel off the end section of last 
corresponding to this last block, and repeat. That is to say, the set 
of values 

cpi = last(N); cp 2 = last{cp\ — l ); cp 3 = last(cp 2 — 1); . . . (10) 

are the index values giving the locations of the change-points, in 
reverse order. Note that the positions of the change-points are not 
necessarily fixed until the very last iteration, although in practice 
it turns out that they become more or less “frozen” once a few 
succeeding change-points have been detected. MatLat@ code for 
optimal partitioning of event data is given in Appendix A. 

2.7 Fixing the Parameter in the Prior Distribution for 

N blocks 

As mentioned in §1.91 the output of the algorithm is dependent on 
value of the parameter 7, characterizing the assumed prior distri- 
bution for the number of blocks, eq. (l3lh In many applications the 
results are rather insensitive to the value as long as the signal-to- 
noise ratio is even moderately large. Nevertheless extreme values 

4 TMrp^g Mathworks, Inc 
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of this parameter give bad results in the form of clearly too few 
or too many blocks. In any case one must select a value to use in 
applications. 

This situation is much like that of selecting a smoothing param- 
eter in various data analysis applications, e.g. density estimation. 
In such contexts there is no perfect choice but instead a tradeoff be- 
tween bias and variance. Here the tradeoff is between a conservative 
choice not fooled by noise fluctuations but potentially missing real 
changes, and a liberal choice better capturing changes but yield- 
ing some false detections. Several approaches have proven useful in 
elucidating this tradeoff. Merely running the algorithm with a few 
different values can indicate a range over which the block represen- 
tation is reasonable and not very sensitive to the parameter value 
(cf. Fig. Up. 

The discussion of fitness functions below in Ogives implementa- 
tion details of an objective method for calibrating ncp_prior as a 
function of the number of data points. It is based on relating this 
parameter to the false positive probability - that is, the relative 
frequency with which the algorithm falsely reports detection of a 
change-point in data with no signal present. It is convenient to use 
the complementary quantity 

Po = 1 — false positive probability . (11) 

This number is the frequency with which the algorithm correctly 
rejects the presence of a change-point in pure noise. Therefore it 
is also the probability that a change-point reported by the algo- 
rithm with this value of ncp_prior is indeed statistically significant 
- hence we call it the correct detection rate for single change-points. 

The needed ncp_prior-po relationship is easily found by noting 
that the rates of correct and incorrect responses to fluctuations in 
simulated pure noise can be controlled by adjusting the value of 
ncp_prior. The procedure is: generate a synthetic pure-noise time 
series; apply the algorithm for a range of ncp_prior; select the 
smallest value that yields false detection frequency equal or less than 
the desired rate, such as .05. The values of ncp_prior determined 
in this way are averaged over a large number of realizations of the 
random data. The result depends on only the number of data points 
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and the adopted value of po : 

ncpjprior = ip(N,p 0 ) . (12) 

Results from simulations of this kind are given below for the various 
fitness functions in § §3.11 13.21 and 13.31 We have no exact formulas, 
but rather fits to these numerical simulations. 

The above discussion is useful in the simple problem of decid- 
ing whether or not a signal is present above a background of noisy 
observations. In other words we have a procedure for assigning a 
value of ncp_prior that results in an acceptable frequency of spu- 
rious change-points, or false positives, when searching for a single 
statistically significant change. Real-time triggering on transients 
( §4.31) is an example of this situation, as is any case where detection 
of a single change-point is the only issue in play. 

But elucidating the shape of an actual detected signal lies outside 
the scope of the above procedure, since it is based on a pure noise 
model. A more general goal is to limit the number of both false 
negatives and false positives in the context of an extended signal. 
The choice of the parameter value here depends on the nature of 
the signal present and the signal-to-noise level. One expects that 
somewhat larger values of ncp_prior are necessary to guard against 
corruption of the estimate of the signal’s shape due to errors at 
multiple change-points. 

This idea suggests a simple extension of the above procedure. 
Assume that a value of p 0 , the probability of correct detection of 
an individual change-point, has been adopted and the correspond- 
ing value of ncp_prior determined with pure noise simulations as 
outlined above and expressed in eq. (1T2|) . For a complex signal our 
goal is correct detection of not just one, but several change-points, 
say N cp in number. The trick is to treat each of them as an inde- 
pendent detection of a single change-point with success rate pq. The 
probability of all N cp successes follows from the law of compound 
probabilities: 

p(N cp )=p f cp . (13) 

There are problems with this analysis in that the following are not 
true: 

(1) Change-point detection in pure noise and in a signal are the 
same. 
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(2) The detections are independent of each other. 


(3) We know the value of N cp . 

All of these statements would have to be true for eq. (1131) to be 
rigorously valid. We propose to regard the first two as approximately 
true and address the third as follows: Decide that the probability of 
correctly detecting all the change-points should be at least a high as 
some value p*, such as 0.95. Apply the algorithm using the value of 
ncp_prior = ^(iV,p*) given by the pure noise simulation. Use eq. 
(USD and the number of change-points thus found to yield a revised 
value 

ncpjprior = ip(N ,p^ Ncp ) ■ (14) 


Stopping when the iteration produces no further modification of the 
set of change-points, one has the recommended value of ncp_prior. 
This ad hoc procedure is not rigorous, but it establishes a kind of 
consistency and has proven useful in all the cases where we have tried 


it, e.g. [Norris Gehrels and Scargle 2010, Norris Gehrels and Scargle 20F1 



Figure 1: Cross-validation error of BATSE TTE data (averaged over 532 GRBs, 
8 random subsamples, and time) for a range of values of -log 7 = with 3er error 
bars. 

Fig. |T] shows another approach, based on cross-validation of the 
data being analyzed (c/. [Hogg 2008 ) . This study uses the collec- 
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tion of raw TTE data at the BATSE web site 

f tp : / /legacy . gsf c .nasa.gov/compton/data/batse/ ascii_dat a/bat se_tte/. 

The hies for each of 532 GRBs contain time tags for all photons de- 
tected for that burst. The energy and detector tags in the data hies 
were not used here, but 1 14.11 shows an example using the former. 
An ordinary 256-bin histogram of all photon times for each of 532 
GRBs was taken as the true signal for that burst. Eight random sub- 
samples smaller by a factor of 8 were analyzed with the algorithm 
using the fitness in eq. (TT9l) . The average RMS error between these 
block representations (evaluated at the same 256 time points) and 
the histogram is roughly hat over a broad range. While this illustra- 
tion with a relatively homogeneous data set should obviously not be 
taken as universal, the general behavior seen here - determination 
of a broad range of nearly equally optimal values of ncp_prior - is 
characteristic of a wide variety of situations. 

2.8 Analysis of Variance 

Assessment of uncertainty is an important part of any data analysis 
procedure. The observational errors discussed throughout this paper 
are propagated by the algorithm to yield corresponding uncertainties 
in the block representation and its parameters. The propagation of 
stochastic variability in the astronomical source is a separate issue, 
called cosmic variance , and is not discussed here. 

Since the results here comprise a complete function defined by 
a variable number of parameters, quantification of uncertainty is 
considerably more intricate than for a single parameter. In partic- 
ular one must specify precisely which of the block representation’s 
aspects is at issue. Here we discuss three: (a) the full block rep- 
resentation, (b) the very presence of the change-points themselves, 
and (c) locations of change-points. 

A straightforward way to deal with (a) is by bootstrap analysis. 
As described in [Efron and Tibshirani 1998] for time series data this 
procedure is rather complicated in general. However resampling of 
event data in the manner appropriate to the bootstrap is trivial. The 
procedure is to run the algorithm on each of many bootstrap sam- 
ples and evaluate the resulting block representations at a common 
set of evenly spaced times. In this way models with different num- 
bers and locations of change-points can be added, yielding means 
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and variances for the estimated block light curves. The bootstrap 
variance is an indicator of light curve uncertainty. In addition com- 
parison of the bootstrap mean with the block representation from 
the actual data adds information about modeling bias. The former 
is rather like a model average in the Bayesian context. This average 
typically smoothes out the discontinuous block edges present in any 
one representation. In some applications the bootstrap mean may 
be more useful than the block representation. 

This method does not seem to be useful for studying uncertainty 
in the change-points themselves, in particular their number, presum- 
ably because the duplication of data points due to the replacement 
feature of the resampling yields excess blocks (but with random lo- 
cations and small amplitude variance, and therefore with little effect 
on the mean light curve). 

By (b) is meant an assessment of the statistical significance of the 
identification of a given change point. For a given change-point we 
suggest quantification of this uncertainty by evaluating the ratio of 
the fitness functions for the two blocks on either side of that change- 
point to that of the single block that would exist if the change-point 
were not there. The corresponding difference of the (logarithmic) 
fitness values should be adjusted by the value of the constant param- 
eter ncp_prior, for consistency with the way fitness is computed in 
the algorithm. 

Finally, (c) is easily addressed in an approximate way by fixing all 
but one change-point and computing fitness as a function of the loca- 
tion of that change-point. This assessment is approximate because 
by fixing the other change-points because it ignores inter-change- 
point dependences. One then converts the run of the fitness func- 
tion with change-point location into a normalized probability dis- 
tribution, giving comprehensive statistical information about that 
location (mean, variance, confidence interval, etc.) 

Sample results of all of these uncertainty measures in connection 
with analysis of a gamma-ray burst light curve are shown below in 
1 14.11 especially Fig. [8j 

2.9 Multivariate Time Series 

Our algorithm’s intentionally flexible data interface not only allows 
processing a wide variety of data modes but also facilitates joint 
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analysis of mode combinations. This feature allows one to obtain 
the optimal block representation of several concurrent data streams 
with arbitrary modes and sample times. This analysis is joint in 
the sense that the change-points are constrained to be at the same 
times for all the input series; in other words the block edges for all 
of the input data series line up. The representation is optimal for 
the data as a whole but not for the individual time series. 

To interpret the result of a multivariate analysis one can study 
the blocks in the different series in two ways: (a) separately, but with 
the realization that the locations of their edges are determined by 
all the data; or (b) in a combined representation. The latter requires 
that there be a meaningful way to combine amplitudes. For example 
the plot of a joint analysis of event and binned data could simply 
display the combined event rate for each block, perhaps adjusting 
for exposure differences. For other modes, such as photon events 
and radio frequency fluxes, a joint display would have to involve a 
spectral model or some sort of relative normalization. The example 
in 1 14.21 below will help clarify these issues. 

The idea extending the basic algorithm to incorporate multiple 
time series is simple. Each datum in any mode has a time-tag asso- 
ciated with it - for example the event time, the time of a bin center, 
or the time of a point measurement. The joint change-points are 
allowed to occur at any one of these times. Hence the times from 
all of the separate data streams are collected together into a single 
ordered array; the ordering means that the times - as well as the 
measurement data - from the different modes are interleaved. The 
cartoon in Fig. [2] shows how the individual concatenated times and 
data series are placed in separate blocks in a matrix (top) and then 
redistributed (bottom) by ordering the combined times. Then the 
fitness function for a given data series can be obtained from the 
corresponding data slice (e.g. the horizontal dashed line in the fig- 
ure, for Series # 2). The zero entries in these slices (indicated by 
white space in the figure) are such that the fitness function for data 
from each series is evaluated for only the appropriate data and mode 
combination. The overall fitness is then simply the sum of those for 
the several data series. The details of this procedure are described 
in the code provided in Appendix A (h JAl) . 
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Figure 2: Cartoon depicting an example of how three data series are first con- 
catenated into a matrix (top) and then redistributed by ordering the combined 
time-tags (bottom). The cost functions for the series can then be computed 
from the data in horizontal slices ( e.g . dashed line) and combined, allowing the 
change-points to be at any of the time tags. 

2.10 Comparison with Theoretical Optimal Detection Ef- 
ficiency 

How good is the algorithm at extracting weak signals in noisy data? 

This section gives evidence that it achieves detection sensitivity 

closely approaching ideal theoretical limits. The formalism in | Arias- Castro, Donoho and Huo 1 
treats detection of geometric objects in data spaces of arbitrary di- 
mension using multiscale methods. The one dimensional special 
case in §11 of this reference is essentially equivalent to our problem 
of detecting a single block in noisy time series. 

Given N measurements normalized so that the observational er- 
rors ~ 7V(0, <r) (normally distributed with zero mean and variance 
a 2 ), these authors show that the threshold for detection is 

A\ — a \J2 logN . (15) 

This result is asymptotic [i.e. valid in the limit of large N). It is 
valid for a frequentist detection strategy based on testing whether 
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the maximum of the inner product of the model with the data ex- 
ceeds the quantity in eq. (1T5|) or not. These authors state “In 
short, we can efficiently and reliably detect intervals of amplitude 
roughly \J 2 logiV , but not smaller.” More formally the result is 
that asymptotically their test is powerful for signals of amplitude 
greater than A\ and powerless for weaker signals. 



0 10 20 30 40 50 60 70 80 90 100 


Figure 3: One hundred unit variance normally distributed measurements - zero- 
mean (+) except for a block of events 25-75 (dots). In the four panels the block 
amplitudes are 0.2, .32, .5, and 1.0 in units of the Arias-Castro et al. threshold 
V 2 logN . Thick lines show the blocks, where detected, with thin vertical lines 
at the change-points. 

It is of interest to see how well our algorithm stacks up against 
these theoretical results, since the two analysis approaches (matched 
filter test statistic vs. Bayesian model selection) are fundamentally 
different. Consider a simulation consisting of normally distributed 
measurements at arbitrary times in an interval. These variates are 
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taken to be zero-mean-normal, except over an unknown sub-interval 
where the mean is a hxed constant. In this experiment the events 
are evenly spaced, but only their order matters, so the results would 
be the same for arbitrary spacing of the events. Fig. [ 3 ] shows syn- 
thetic data for four simulated realizations with different values for 
this constant. The solid line is the Bayesian blocks representation, 
using the posterior in Eq. (110211 . For the small amplitudes in the 
first two panels no change-points are found; these weak signals are 
completely missed. In the other two panels the signals are detected 
and approximately correctly represented - with small errors in the 
locations of the change-points. 




Figure 4: Error in finding a single block vs. simulated block amplitude in units 
of Arias-Castro et al.’s threshold amplitude. The curves (from right to left) are 
for N = 32, 64, 128, 256, 1024 and 2048. 
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Fig. [4] reports some results of detection of the same step-function 
process shown in Fig. [3l averaged over many different realizations 
of the observational error process and for several different values of 
N. The lines are plots of a simple error metric (combing the errors 
in the number of change-points and their locations) as a function of 
the amplitude of the test signal. The left panel is for the case where 
the number of points in the putative block is held fixed, whereas 
the right panel is this number is taken to be proportional to N, 
sometimes a more realistic situation. We have adopted the following 
definition for the threshold in this case: 

M = 8. . (16) 

This formula is consistent with adjusting the normalized width in 
|Arias-Castro, Donoho and Huo 2003 with a factor N: 8 is an arbi- 
trary factor for plotting. 

Our method yields small errors when the signal amplitude is 
on the order or even somewhat smaller than the limit stated by 
|Arias-Castro, Donoho and Huo 2003] , showing that we are indeed 
close to their theoretical limit. The main difference here is that our 
results are for specific values of N and the theoretical results are 
asymptotic in N. 

3 Block Fitness Functions 

To complete the algorithm all that remains is to define the model 
fitness function appropriate to a particular data mode. By equation 
(JTJ) it is sufficient to define a block fitness function , which can be 
any convenient measure of how well a constant signal represents the 
data in the block. Naturally this measure will depend on all data in 
the block and not on any outside it. As explained in 1 12.41 it cannot 
depend on any model parameters other than those specifying the 
locations of the block edges. In practice this means that block height 
(signal amplitude) must somehow be eliminated as a parameter. 
This can be accomplished, for example, by taking block fitness to 
be the relevant likelihood either maximized or marginalized with 
respect to this parameter. Either choice yields a quantity good 
for comparing alternative models, but not necessarily for assessing 
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goodness-of-fit of a single model. Note that these measures as such 
do not satisfy the additivity condition Eq. As long as the cell 
measurement errors are independent of each other the likelihood of 
a string of blocks is the product of the individual values, but not the 
required sum. But simply taking the logarithm yields the necessary 
additivity. 

There is considerable freedom in choosing fitness functions to 
be used for a given type of data. The examples described here have 
proven useful in various circumstances, but the reader is encouraged 
to explore other block-additive functions that might be more appro- 
priate in a given application. For all cases considered in this paper 
the fitness function depends on data in the block through summary 
parameters called sufficient statistics , capturing the statistical be- 
havior of the data. If these parameters are sums of quantities defined 
on the cells the computations are simplified; however this condition 
is not essential. 

Two types of factors in the block fitness can be ignored. A con- 
stant factor C appearing in the likelihood for each data cell yields 
an overall constant term in the derived logarithmic fitness function 
for the whole time series, namely N log C. Such a term is indepen- 
dent of all model parameters and therefore irrelevant for the model 
comparison in the optimization algorithm. In addition, while a term 
in the block fitness that has the same value for each block does affect 
total model fitness, it contributes a term proportional to the number 
of blocks, and which therefore can be absorbed into the parameter 
derived from the prior on the number of blocks (cf. fll.91) . 

Many of the data modes discussed in the following subsections 
were operative in the Burst and Transient Source Experiment (BATSE) 
experiment on the NASA Compton Gamma Ray Observatory (GRO), 
the Swift Gamma-Ray Burst Mission, the Fermi Gamma Ray Space 
Telescope, and many x-ray and other high-energy observatories. 
They are also relevant in a wide range of other applications. 

In the rest of this section we exhibit expressions that serve as 
practical and reliable fitness functions for the three most common 
data modes: event data, binned data, and point measurements with 
normal errors. Some refinements of this discussion and some other 
less common data modes are discussed in Appendix C, IjG] 
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3.1 Event Data 


For series of times of discrete events it is natural to associate one 
data cell ( §2.2h with each event. The following derivation of the 
appropriate block fitness will elucidate exactly what information 
the cells must contain to allow evaluation of the fitness for the full 
multi-block model. 

In practice the event times are integer multiples of some small 
unit (< 1C.1|) but it is often convenient to treat them as real numbers 
on a continuum. For example the fitness function is easily obtained 
starting with the unbinned likelihood known as the Cash statis- 
tic ( [Cash 1979] : a thorough discussion is in |Tompkins 1999| ). If 
M(t, 6) is a model of the time dependence of a signal the unbinned 
log-likelihood is 

1 ogL(d) = J2 logM(t n , 6) - [ M(t , 0)dt , (17) 

71 J 


where the sum is over the events and 9 represents the model pa- 
rameters. The integral is over the observation interval and is the 
expected number of events under the model. Our block model is 
constant with a single parameter, M(t, A) = A, so for block k 

logL^(X) = N^logX - X T (fc) , (18) 

where N ( - k > is the number of events in block k and T ^ is the length 
of the block. The maximum of this likelihood is at A = /T^ k \ 

yielding 


log LW, + N {k) = N( k \ log N™ - log T {k) ) 


(19) 


The term N ^ is taken to the left side because its sum over the 
blocks is a constant (IV, the total number of events) that is model- 
independent and therefore irrelevant. Moreover note that changing 
the units of time, say by a scale factor a, changes the log-likelihood 
by —N^ 1 og{a) 1 irrelevant for the same reason. This felicitous 
property holds for other maximum likelihood fitness functions and 
removes what would otherwise be a parameter of the optimization. 
This effective scale invariance and the simplicity of eq. (fl9l) make its 
block sum the fitness function of choice to find the optimum block 
representation of event data. A possible exception is the case where 
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detection of more than one event at a given time is not possible, 
e.g. due to detector, deadtime, in which case the fitness function in 
Appendix C, < fC.2l may be more appropriate. 

It is now obvious what information a cell must contain to allow 
evaluation of the sufficient statistics N ^ and T ^ by summing two 
quantities over the cells in a block. First it must contain the number 
of events in the cell. (This is typically one, but can be more depend- 
ing on how duplicate time tags are handled; see the code section in 
Appendix A, IjA] dealing with duplicate time-tags, or ones that are 
so close that it makes sense to treat them as identical). Second, it 
must contain the interval 


At n (tn+ 1 t n ~ l)/2 , 


( 20 ) 


representing the contribution of cell n to the length of the block. 
This interval contains all times closer to event n than to any other. 

It is defined by the midpoints between successive events, and gener- 
alizes to data spaces of any dimension, where it is called the Voronoi 
tessellation of the data points, [Okabe, B oots, Sugi hara a nd Chiu 2000 
Scargle 2001a[ |Scargle 2001c]). Because 1/A t n can be regarded as 
an estimate of the local event rate at time t n , it is natural to visu- 
alize the corresponding data cell as the unit-area rectangle of width 
A t n and height 1/A t n . These ideas lead to the comment in 1 1 1.81 that 
the event-by-event adjustment for exposure can be implemented by 
shrinking A t n by the exposure factor e n . 

It is interesting to note that the actual locations of the (indepen- 
dent) events within their block do not matter. The fitness function 
depends on only the number of events in the block, not their loca- 
tions or the intervals between them. This result flows directly from 
the nature of the underlying independently distributed, or Poisson, 
process (see Appendix B, l lBl) . 

We conclude this section with evaluation of the calibration of 
ncp_prior from simulations of signal-free observational noise as de- 
scribed in 1 12.71 The results of extensive simulations for a range of 
values of N and the adopted false positive rate p 0 introduced in Eq. 
(ITT]) were found to be well fit with the formula 


ncp_prior = 4 — 73.53poN 


-.478 


( 21 ) 


For example, with po = .01 and N = 1,000 this formula gives 
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ncp_prior = 3.97. 


3.2 Binned Event Data 


The expected count in a bin is the product A eW of the true event 
rate A at the detector, a dimensionless exposure factor e ( HI. 80 . and 
the width of the bin W . Therefore the likelihood for bin n is given 
by the Poisson distribution 


Ln 


(A e n W n ) N "e 
NJ 


N„ a —Xe„W„ 


( 22 ) 


where N n is the number of events in bin n, A is the actual event 
rate in counts per unit time, e n is the exposure averaged over the 
bin, and W n is the bin width in time units. Defining bin efficiency 
as w n = e n W n , the likelihood for block k is the product of the 
likelihoods of all its bins: 

mW 

L^ = n L n = X NW e ~ XwW . (23) 

72—1 

Here M ^ is the number of bins in block k, 


mW 

72=1 


(24) 


is the sum of the bin efficiencies in the block, and 

if( k > 

N {k) =J2 N n (25) 

72=1 


is the total event count in the block. The factor (e n W n ) Nn / N n \ has 
been discarded because its product over all the bins in all the blocks 
is a constant (depending on the data only) and therefore irrelevant 
to model fitness. The log-likelihood is 

logL (fc) = N ^ k) logA - A w w , (26) 

identical to eq. ([IS]) with w ^ playing the role of T^ k \ a natural 
association since it is an effective block duration. Moreover in ret- 
rospect it is understandable that unbinned and binned event data 
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have the same fitness function, especially in view of the analysis in 
1 1C. II where ticks are allowed to contain more than one event and 
are thus equivalent to bins. In addition the way variable exposure 
is treated here could just as well have been applied to event data 
in the previous section. Note that in all of the above the bins are 
not assumed to be equal or contiguous - there can be arbitrary gaps 
between them f § 1 . 71) . 



Figure 5: Simulation study, based on the false positive rate of 0.05, to determine 
ncp_prior = -logfy) for binned data. Contours of this parameter are shown as a 
function of the number of bins and number of data points (logarithmic x- and 
y- axes, respectively). The heavy dashed line indicates the undesirable region 
where the numbers of bins and data points are equal. 

We now turn to the determination of ncp_prior for binned data. 
Figure [5] is a contour plot of the values of this parameter based on 
a simulation study with bins containing independently distributed 
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events. These contours are very insensitive to the false positive rate, 
which was taken as .05 in this figure. 


3.3 Point Measurements 

A common experimental scenario is to measure a signal s(t) at a se- 
quence of times t n , n — 1, 2, . . . , N in order to characterize its time 
dependence. Inevitable corruption due to observational errors is fre- 
quently countered by smoothing the data and/or fitting a model. As 
with the other data modes Bayesian Blocks is a different approach 
to this issue, making use of knowledge of the observational error dis- 
tribution and avoiding the information loss entailed by smoothing. 
In our treatment the set of observation times t n , collectively known 
as the sampling, can be anything - evenly spaced points or other- 
wise. Furthermore we explicitly assume that the measurements at 
these times are independent of each other, which is to say the errors 
of observation are statistically independent. 

Typically these errors are random and additive, so that the ob- 
served time series can be modeled as 


x n = x(t n ) = s(t n ) + z n n — 1,2, ... N . (27) 


The observational error z n , at time t n , is known only through its 
statistical distribution. Consider the case where the errors are taken 
to obey a normal probability distribution with zero mean and given 
variance: 

P(z n )dz n = -= e~^ )2 dz n . (28) 

O’nv 27T 

Using eqs. (12 7j) and (12%)) if the model signal is the constant s = A 
the likelihood of measurement n is 


L 


n 



1 / x n — A \2 
c 2 V Vn ’ 


(29) 


Since we assume independence of the measurements the block k 
likelihood is 


L“> = n r 

n 


(27 r) 2 1 / %-A \2 

1 g 2 Z-^/n ' a n 1 

ri m 


(30) 


Both the products and sum are over those values of the index such 
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that t lies in block k. The quantities multiplying the exponentials 
in both the above equations are irrelevant because they contribute 
an overall constant factor to the total likelihood. 

We now derive the maximum likelihood fitness function for this 
data mode (with other forms based on different priors relegated to 
Appendix C, § flC.41 IU31 IC.6l and lC.7p . The quantities 


_ 1 ^ 1 

ak 2 ^ a 2 

n ^ n 

(31) 

h = ~ 2^ — 

n 

(32) 

1 X ^ 

* = 5^ 

" n u n 

(33) 

appear in all versions of these fitness functions; the first two are 
sufficient statistics. 

As usual we need to remove the dependence of eq. (130T) on the 
parameter A, and here we accomplish this result by finding the value 
of A which maximizes the block likelihood, that is by maximizing 

Wd” V 

^ n 

(34) 

This is easily found to be 


n °n n' ° ri 

(35) 


(36) 


As expected this maximum likelihood amplitude is just the weighted 
mean value of the observations x n within the block, because defining 
the weights 

j_ 

w n = T , (37) 

n' 

yields 

^ max ^ ^ • ( 38 ) 

n 

Inserting Eq. into the log of Eq. (15U]I with the irrelevant 

factors omitted yields the corresponding maximum value of the log- 
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likelihood itself: 


log 


1 

2 


E 


X n + 


2d k \2 


(J ri 


(39) 


where again the sums are over the data in block k. Expanding the 
square 


logiSL 



(40) 


dropping the first term (quadratic in x ) which also sums to a model- 
independent constant, and using equations (1TD and (TT21) we arrive 
at 


logl/mL = b\jAa k 


(41) 


As expected each data cell must contain x n and a n but we now 
see that these quantities enter the fitness function through the sum- 
mands in the equations and (f3?l) defining a k and b k (c k does not 
matter), namely l/(2a^) and —x n /a 2 . The way the corresponding 
block summations are implemented is described in Appendix A fjAl 
( c.f . data mode #3). 

A few additional notes may be helpful. In the familiar case in 
which the error variance is assumed to be time-independent cr can be 
carried as an overall constant and a n does not have to be specified 
in each data cell. The t n are only relevant in determining which cells 
belong in a block and do not enter the fitness computation explicitly. 
And the fitness function in Eq. (1TTD is manifestly invariant to a scale 
change in the measured quantity, as is the alternative form derived 
in Appendix C, Eq. (j94|) . That is to say under the transformation 


x n — y dx n) a n — y acr 1 


(42) 


corresponding for example to a simple change in the units of x and 
cr, the fitness does not change. 

Figure [HI exhibits a simulation study to calibrate ncp_prior for 
normally distributed point measurements. For illustration the pure 
noise data simulated was normally distributed with a mean of 10 and 
unit variance. The left-hand panel shows how the false positive rate 
is diminished as ncp_prior is increased, for the 8 values of N listed 
in the caption. The horizontal line is at the adopted false positive 
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rate of 0.05; the points at which these curves cross below this line 
generate the curve shown in the bottom panel. The linear fit in the 
latter depicts the relation ncp_prior = 1.32 + 0.577 log 10 (iV). This 
relation is insensitive to the signal-to-noise ratio in the simulations. 
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Figure 6: Simulations of point measurements (Gaussian noise with signal-to- 
noise ratio of 10) to determine ncp_prigg = -log( 7 ). Top: false positive fraction 
Po vs. value of ncp_prior with separate curves for the values N = 8, 16, 32, 64, 
128, 256, 512 and 1024 (left to right; alternating dots, + and circles). The points 
at which the rate becomes unacceptable (here .05; dashed line) determines the 
recommended values of ncp_prior shown as a function of N in the bottom panel. 





4 Examples 


The following subsections present illustrative examples with sample 
data sets, demonstrating block representation for TTE data, mul- 
tivariate time series, triggering, the empty block problem for TTE 
data, and data on the circle. 

4.1 BATSE Gamma Ray Burst TTE Data 

Trigger 551 in the BATSE catalog (4B catalog name 910718) was 
chosen to exemplify analysis of time-tagged event data as it has 
moderate pulse structure. See 1 12.71 for a description of the data 
source. Figure [7] shows analysis of all of the event data in the top 
panels, and separated into the four energy channels in the lower 
panels. On the left are optimal block representations and the right 
shows the corresponding data in 32 evenly spaced bins. 

In all five cases the optimal block representations based on the 
block fitness function for event data in eq. (TT9|) are depicted for 
two cases, using values the values of ncp_prior: (1) from eq. (l2Tf 
with p 0 = 0.05 (solid lines); and (2) found with the iterative scheme 
described in 42.71 (lightly shaded blocks bounded by dashed lines). 
These two results are identical for all cases except channel 3, where 
the iterative scheme’s more conservative control of false positives 
yields fewer blocks (9 instead of 13). 

Note that the ordinary histograms of the photon times in the 
right-hand panels leave considerable uncertainty as to what the sig- 
nificant and true structures are. In the optimal block representations 
two salient conclusions are clear: (1) there are three pulses, and (2) 
they are most clearly delineated at higher energies. 
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Figure 7: BATSE TTE data for Trigger 0551. Top panels: all photons. 
Other panels: photons in the four BATSE energy channels. Left column 
shows Bayesian Block representations: default ncp_prior = solid lines; iterated 
ncp_prior = shaded/dashed lines. Right column: ordinary evenly spaced binned 
histograms. 



This figure depicts the error analysis procedures described above 
in §2U 


x 10 




Figure 8: Error analysis for the data in Channel 4 from Fig. [3 zooming in on 
the time interval with most of the activity. Top: Heavy solid line is bootstrap 
mean (256 realizations), with thin lines giving the ±ler RMS deviations, all 
superimposed on the BB representation. Bottom: approximate posterior distri- 
bution functions for the locations of the change-points, obtained by fixing all of 
the others. 
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4.2 Multivariate Time Series 

This example in Fig. [9] demonstrates the multvariate capability 
of Bayesian Blocks by analyzing data consisting of three different 
modes sampled randomly from a synthetic signal. Time-tagged 
events, binned data, and normally distributed measurements were 
independently drawn from the same signal and analyzed separately, 
yielding the block representations depicted with thin lines. 

The joint analysis of the data combined using the multivariate 
feature described above in 1 12.91 is represented as the thick dashed 
line. None of these analyses is perfect, of course, due to the sta- 
tistical fluctuations in the data. The combined analysis finds a few 
spurious change-points, but overall these do not represent serious 
distortions of the true signal. The individual analyses are some- 
what poorer at capturing the true change-points and only the true 
change-points. Hence in this example the combined analysis makes 
effective use of disparate data modes from the same signal. 

4.3 Real Time Analysis: Triggers 

Because of its incremental structure our algorithm is well suited for 
real-time analysis. Starting with a small amount of data the algo- 
rithm typically finds no change-points at first. Then by determining 
the optimal partition up to and including the most recently added 
data cell the algorithm effectively tests for the presence of the first 
change-point. The real time mode can be selected simply by trig- 
gering on the condition last(R) > 1 inserted into the code shown 
in Appendix A , fjAj just before the end of the basic iterative loop 
on R. For the entry of 1 in each element of array last means that 
the optimal partition consists of the whole array encountered so far. 
It is thus obvious that this first indication of change-point cannot 
yield more than one change-point. 

Thus the algorithm can be set to return at the first significant 
change-point. Other more complicated halting or return conditions 
can be programmed into the algorithm, such as returning after a 
specified number of change-points have been found, or when the 
location of a change-point has not moved for a specified length of 
time, etc. Essentially any condition on the change-points or the 
corresponding blocks can be imposed as a halt-and-return condition. 

The real time mode is mainly of use to detect the first sign of 
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a time-dependent signal rising significantly above a slowly varying 
background. For example, in a photon stream the resulting trigger 
may indicate the presence of a new bursting or transient source. 

The conventional way to approach problems of this sort is to re- 
port a detection if and when the actual event rate, averaged over 
some interval, exceeds one or more pre-set thresholds. See [Band 2002] 
for an extensive discussion, as well as |Fenimore et al. 20011 IMcLean et al. 2 003. 
ISehmidt 1999] for other applications in high energy astrophysics. 

One must consider a wide range of configurations: “BAT uses about 
800 different criteria to detect GRBs, each defined by a large num- 
ber of commandable parameters.” [McLean et al. 2003] . Both the 
size and locations of the intervals over which the signal is averaged 
affect the result, and therefore one must consider many different 
values of the corresponding parameters. The idea is to minimize 
the chances of missing a signal because, for example, its duration 
is poorly matched to the interval size chosen. If the background is 
determined dynamically, by averaging over an interval in which it 
is presumed there is no signal, similar considerations apply to this 
interval. 

Our segmentation algorithm greatly simplifies the above consid- 
erations, since predefined bin sizes and locations are not needed, 
and the background is automatically determined in real time. In 
practice there can be a slight complication for a continuously ac- 
cumulating data stream, since the N 2 dependence of the compute 
time may eventually make the computations unfeasible. A simple 
countermeasure is to analyze the data in a sliding window of mod- 
erate size - large enough to capture the desired changes but not so 
large that the computations take too long. Slow variations in the 
background in many cases could mandate something like a sliding 
window anyway. 

Because of additional complexities, such as accounting for back- 
ground variability and the Pandora’s box that spectral resolution 
opens [Band 2002] . we will defer a serious treatment of triggers to a 
future publication. 

We end with a few comments on the false alarm (also called false 
positive ) rate in the context of triggers. The considerations are very 
similar to the tradeoff discussed in the context of the choice for 
the parameter ncp_prior described in §1.91 1 12.71 and §3] for the 
various data modes. Even if no signal is present a sufficiently large 
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(and therefore rare) noise fluctuation can trigger any algorithm’s 
detection criteria. Unavoidably all detection procedures embody a 
trade-off between sensitivity and rate of false alarms. Other things 
being equal, making an algorithm more able to trigger on weak 
signals renders it more sensitive to noise fluctuations. Conversely 
making an algorithm shun noise fluctuations renders it insensitive to 
weak signals. In practice one chooses a balance of these competing 
factors based on the relative importance of avoiding false positives 
and not missing weak signals. Hence there can be no universal 
prescription. 

4.4 Empty Blocks 

Recall that blocks are taken to begin and end with data cells ( H2.5|h 
This convention means that no block can be empty: each much 
contain at least its initiating data cell. Hence in the case of event 
data, blocks cannot represent intervals of zero event rate. This con- 
straint is of no consequence for the other two data modes. There 
is nothing special about zero (or even negative) signals in the case 
of point measurements. Zero signal would be indicated by inter- 
vals containing only measured values not significantly different from 
zero. There is also no issue for binned data as nothing prevents 
a block from consisting of one or more empty data bins. In many 
event data applications zero signal may never occur (e.g. if there is 
a significant background over the entire observation interval). But 
in other cases it may be useful to represent such intervals in the 
form of a truly empty block, with corresponding zero height. 

Allowing such null blocks is easily implemented in a post-processing 
step applied to each of the change-points. The idea is to consider 
reassignment of data cells at the start or end of a block to the adjoin- 
ing block while leaving the block lengths unchanged. For a given 
change-point separating a pair of two blocks (“left” and “right”) 
there are two possibilities: (a) the datum marking the change-point 
itself, currently initiating the right block, can be moved from the 
right to the left block; (b) the datum just prior to the change-point 
itself, currently ending the left block, can be moved from the left 
block to the right block. Straightforward evaluation of the relevant 
fitness functions establishes whether one of these moves increases 
the fitness of the pair, and if so which one. (It is impossible that 
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this calculation will favor both moves (a) and (b); taken together 
they yield no net change and therefore leave fitness unchanged.) 

The suggested procedure is to carry out this comparison for each 
change-point in turn and adjust the populations of the blocks ac- 
cordingly. We have not proved that this ad hoc prescription yields 
globally optimal models with the non-emptiness constraint removed, 
but it is obvious that the prescription can only increase overall model 
fitness. It is quite simple computationally and there is no real down- 
side to using it routinely, even if the moves are almost never trig- 
gered. A code fragment to implement this procedure is given in 
Appendix A, gAl 

4.5 Blocks on the Circle 

Each of the data spaces discussed so far has been a linear interval 
with a well defined beginning and end. A circle does not have this 
property. Our algorithm cannot be applied to data defined on a 
circle]! such as directional measurements, because it starts with the 
first data point and iteratively works its way forward along the in- 
terval to the last point. Hence the first and last points are treated 
as distant, not as the pair of adjacent points that they are. Any 
choice of starting point, such as the coordinate origin 0 for angles 
on [ 0 , 27 t], disallows the possibility of a block containing data just 
before and after it (on the circle). In short, the iterative (mathe- 
matical induction-like) structure of the algorithm prevents it from 
being independent of the arbitrary choice of origin, which on a circle 
is completely arbitrary. We have been unable to find a solution to 
this problem using a direct application of dynamical programming. 

However there is a method that provides exact solutions at the 
cost of about one order of magnitude more computation time. First 
unfold the data with an arbitrary choice for the fiducial origin. The 
resulting series starts at this origin, continues with the subsequent 
data points in order, and ends at the datum just prior to the fiducial 
origin. Think of cutting a loop of string and straightening it out. 

The basic algorithm is then applied to the data series obtained 
by concatenating three copies of the unfolded data. The underly- 
ing idea is that the central copy is insulated from any effects of 

5 Of course the case where the measured value is confined to a specific subinterval of the 
circle is not a problem. 
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the discontinuity introduced by the unfolding. In extensive tests on 
simulated data this algorithm performed well. One check is whether 
or not the two sets of change-points adjacent to the two divisions 
between the copies of the data are always equivalent (modulo the 
length of the circle). These results suggest but do not prove cor- 
rectness for all data; there may be pathological cases for which it 
fails. Of course this IV 2 computation will take ~ 9 times as long as 
it would if the data were on a simple linear interval. 

Figure [10] shows simulated data representing measurements of an 
angle on the interval [0, 27 t] . In this case the procedure outlined 
above captures the central block (bottom panel) straddling the ori- 
gin that is broken into two parts if the data series is taken to start 
at zero (upper panel). Note that the two blocks just above 0 and 
below 2tc in the upper panel, are rendered as a single block in the 
central cycle in the bottom panel. Figure fill shows the same data 
shown in Figure [TOl plotted explicitly on a circle. 
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Figure 9: Multivariate analysis of synthetic signal consisting of two blocks sur- 
rounding a Gaussian shape centered on the interval [0,1] (solid line). Optimal 
blocks for three independent data series drawn randomly from the probability 
distribution corresponding to this signal are thin lines: 1024 event times (dash); 
4096 events in 32 bins (dot-daslr); and 32 random amplitudes normally dis- 
tributed with mean equal to the signal at random times uniformly distributed 
on [0,1] and constant variance (dots). The thicker dashed line is the combined 
analysis of all three. 
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Figure 10: Data on the circle: events drawn from two normal distributions, 
centered at 7 r and 0, the latter with some points wrapping around to values below 
27 t. Optimal blocks are depicted with thick horizontal bars superimposed on 
ordinary histograms. Top: block representation on the interval [0,27 t]. Bottom: 
Block representation of three concatenated copies of the same data on [0,67 t] . 
Vertical dotted lines at 27r and 47r indicate boundaries between the copies. The 
blocks in the central copy, between these lines, are not influenced by end effects 
and are the correct optimal representation of these circular data. 
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Figure 11: Optimal block representation of the same data as in FigureflOl (cf. the 
middle third of the bottom panel) plotted on the circle. The origin corresponds 
to the positive x-axis. and scale of the radius of the circle is arbitrary. 
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As a footnote, one application that might not be obvious is the 
case of gamma-ray burst light curves which are short enough that the 
background is accurately constant over the duration of the burst. If 
all of the data are rescaled to fit on a circle, then the pre- and post- 
burst background would automatically be subsumed into a single 
block (covering intervals at the beginning and end of the observation 
period). This procedure would be applicable to bursting light curves 
of any kind if and only if the background signal is constant, so that 
the event rates before and after the main burst are the same. 
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5 Conclusions and Future Work 


The Bayesian Blocks algorithm finds the optimal step function model 
of time series data by implementing the dynamical programming 
algorithm of [Jackson et al. 2005] . It is guaranteed to find the rep- 
resentation that maximizes any block-additive fitness function, in 
time of order N 2 , and replaces the greedy approximate algorithm in 
|Scargle 1998j| . Its real-time mode triggers on the first statistically 
significant rate change in a data stream. 

This paper addresses the following issues in the use of the algo- 
rithm for a variety of data modes: gaps and exposure variations, 
piecewise linear and piecewise exponential models, the prior distri- 
bution for the number of blocks, multivariate data, the empty block 
problem (for event data), data on the circle, dispersed data, and 
analysis of variance (’’error analysis”). The algorithm is shown to 

closely approach the theoretical detection limit derived in [Ar ias-Castro, D onoho and Huo 2003 

Work in progress includes extensions to generalized data spaces 
such as those of higher dimensions (cf. [Scargle 2001c|), and speed- 
ing up the algorithm. 
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A Reproducible Research: MatLab Code 


This paper implements the spirit of Reproducible Research, a 
publication protocol initiated by John Claerbout jClaerbout 1990j 
and developed by others at Stanford and elsewhere. The underlying 
idea is that the most effective way of publishing research is to include 
everything necessary to reproduce all of the results presented in the 
paper. In addition to all relevant mathematical equations and the 
reasoning justifying them, full implementation of this protocol re- 
quires that the data files and computer programs used to prepare all 
figures and tables are included. Cogent arguments for Reproducible 
Research, an overview of its development history, and honest as- 
sessment of its successes and failures, are eloquently described in 
Donoho et al. (2008) |. 

Following this discipline all of the MatLab code and data hies 
used in preparing this paper are available as auxiliary material. In- 
cluded is the hie ” read_me . txt” with details and a script ” reproduce_f igures . m" 
that erases all of the figure hies and regenerates them from scratch. 

In some cases the default parameters implement shorter simulation 
studies than those that were used for the figures in the paper, but 
one of the features of Reproducible Research is that such parameters 
and other aspects of the code can be changed and experimented at 
will. Accordingly this collection of scripts includes illustrative ex- 
emplars of the use of and algorithms and serves as a tutorial for the 
methods. 

In addition here is a commented version of the key fragment of 
the MatLab script (named f ind_blocks .m) for the basic algorithm 
described in this paper: 

% For data modes 1 and 2: 

7. nn_vec is the array of cell populations. 

7. Preliminary computation: 

block_length=tt_stop- [tt_start 0 . 5* (tt (2 : end)+tt (1 : end-1 ) ) ’ tt_stop] ; 


7 . 

7 . Start with first data cell; add one cell at each iteration 

7 . 

best = [] ; 
last = [] ; 

for R = 1 :num_points 
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% Compute fit_vec : fitness of putative last block (end at R) 
if data_mode == 3 % Measurements, normal errors 

sum_x_l = cumsum( cell_data( R:-l:l, 1 ) )’; 7„sum(x/sig~2) 
sum_x_0 = cumsum( cell_data( R:-l:l, 2 ) )’; 7„sum(l/sig~2) 
f it_vec=( (sum_x_l (R: -1 : 1) ) 2 ) ./( 4*sum_x_0(R: -1 : 1) ) ; 

else 

arg_log = block_length(l : R) - block_length(R+l) ; 
arg_log( find( arg_log <= 0 ) ) = Inf; 
nn_cum_vec = cumsum( nn_vec(R: -1 : 1) ); 
nn_cum_vec = nn_cum_vec(R: -1 : 1) ; 

fit_vec = nn_cum_vec . * ( log( nn_cum_vec ) - log( arg_log ) ); 

end 

[ best(R), last(R)] = max( [ 0 best ] + fit_vec - ncp_prior ); 

end 

7. 

7, Now find changepoints by iteratively peeling off the last block 

7. 

index = last( num_points ); 
change_points = [] ; 
while index > 1 

change_points = [ index change_points ] ; 
index = last ( index - 1 ) ; 

end 

B Mathematical Details 

Partitions of arrays of data cells are crucial to the block model- 
ing which our algorithm implements. This appendix collects a few 
mathematical facts about partitions and the nature of independent 
events. 

B.l Definition of Partitions 

A partition of a set is a collection of its subsets that add up to the 
whole with no overlap. Formally, a partition is a set of elements, or 
blocks {B k j satisfying 

I = [}B k (43) 

k 
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and 


Bj n B k = 0 (the empty set) for j ^ k. (44) 

Note that these conditions apply to the partitions of the time series 
data by sets of data cells. The data cells themselves may or may not 
partition the whole observation interval, as either the completeness 
in eq. (144)1 or the no-overlap condition in eq. (144)1 may be violated. 

B.2 Reduction of Infinite Partition Space to a Finite One 

For a continuous independent variable, such as time, the space of 
all possible partitions is infinitely large. We address this difficulty 
by introducing a construct in which T and its partitions are repre- 
sented in terms of a collection of N discrete data cells in one-to-one 
correspondence with the measurements H The blocks which make 
up the partitions are sets of data cells contiguous with respect to 
time-order of the cells. I.e. a given block consists of exactly all cells 
with observation times within some sub-interval of T. 

Now consider two sets of partitions of T : (a) all possible par- 
titions (b) all possible collections of cells into blocks. Set (a) is 
infinitely large since the block boundaries consist of arbitrary real 
numbers in T, but set (b) is a finite subset of (a). Nevertheless, 
under reasonable assumptions about the data mode, any partition 
in (a) can be obtained from some partition in (b) by deforming 
boundaries of its blocks without crossing a data point. Because the 
potential of a block to be an element of the optimum partition (see 
the discussion of block fitness in §4]) is a function of the content 
of the cells, such a transformation cannot substantially change the 
fitness of the partition. 

B.3 The Number of Possible Partitions 

How many different partitions of N cells are possible? Represent a 
partition by an ordered set of N zeros and ones, with one indicating 
that the corresponding time is a change-point, and zero that it is 
not. With two choices at each time, the number of combinations is 

^partitions = ^ • (45) 

6 The cells may form a partition of 7~, as for example with event data with no gaps (see 
but it is not necessary that they do so. 
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Except for very short time series this number is too large for an 
exhaustive search, but our algorithm nevertheless finds the optimum 
over this space in a time that scales as only N 2 . 

B.4 A Result for Subpartitions 

We here define subpartitions and prove an elementary corollary that 
is key to the algorithm. 


Definition: a subpartition of a given partition CP(/) 
is a subset of the blocks of CP(J). 


It is obvious that a subpartition is a partition of that subset of 
T consisting of those blocks. Although not a necessary condition 
for the result to be true, in all cases of interest here the blocks 
in the subpartition are contiguous, and thus form a partition of a 
subinterval of T. It follows that: 


Theorem: A subpartition “iP' of an optimal partition CP(7) 
is an optimal partition of the subset /' that it covers. 


For if there were a partition of /', different from and fitter than CP , 
then combining it with the blocks of CP not in CP would, by the block 
additivity condition, yield a partition of Tfitter than CP, contrary to 
the optimality of CP. 

We will make use of the following corollary: 


Corollary: removing the last block of an optimal partition 
leaves an optimal partition. 
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B.5 Essential Nature of the “Poisson” Process 

The term Poisson process refers to events occurring randomly in 
time and independently of each other. That is, the times of the 
events, 

t n ,n = 1,2,..., TV , (46) 

are independently drawn from a given probability distribution. Think 
of the events as darts thrown randomly at the interval. If the distri- 
bution is flat (i.e. the same all over the interval of interest) we have 
a constant rate Poisson process. In this special case a point is just 
as likely to occur anywhere in the interval as it is anywhere else; 
but this need not be so. What must be so in general - the essential 
nature of the Poisson process from a physical point of view - is the 
above-mentioned independence: each dart is not at all influenced by 
the others. Throwing darts that have feathers or magnets, although 
random, is not a Poisson process if these accoutrements cause the 
darts to repel or attract each other. 

This key property of independence determines all of the other 
features of the process. Most important are a set of remarkable 
properties of interval distributions (see e.g. [Pap ou lis 196 5] ). The 
time interval between a given point t 0 and the time t of the next 
event is exponentially distributed 

P(r)dr = Xe~ Xr dT , (47) 

where t — t — to- The remarkable aspect is that it does not matter 
how to is chosen; in particular the distribution is the same whether 
or not an event occurs at to- This fact makes the implementation of 
event-by-event exposure straightforward ( T1.8p . 

Note that we have not mentioned the Poisson distribution itself. 
The number of events in a fixed interval does obey the Poisson 
distribution, but this result is subsidiary to, and follows from, event 
independence. In this sense a better name than Poisson process is 
independent event process. 

In representing intensities of such processes, one scheme is to rep- 
resent each event as a delta-function in time. But a more convenient 
way to extract rate information incorporates the time interval^] be- 
tween photons. Specifically, for each photon consider the interval 

7 A method for analyzing event data based solely on inter-event time intervals has been 
developed in I' lPrahl 19961 ’!. 
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starting half way back to the previous photon and ending half way 
forward to the subsequent photon. This interval, namely 


tn— 1 tn+1 


(48) 


is the set of times closer to t n than to any other timcjl and has 
length equal to the average of the two intervals connected by photon 
n, namely 

A t n = f " +1 ~ f "- 1 . (49) 

Then the reciprocal 

= 7\7~ (50) 

L\i n 

is taken as an estimate of the signal amplitude corresponding to 
observation n. When the photon rate is large, the corresponding 
intervals are small, demonstrates the data cell concept, including 
the simple modifications to account for variable exposure and for 
weighting by photon energy. 

jPrahl 1996| has derived a statistic for event clustering in Poisson 
process data that tests departures from the known interval distri- 
bution by evaluating the likelihood over a restricted interval range. 
Prahl’s statistic is 



E (i 

A Ti<C* 



(51) 


where AT, is the interval between events i and i + 1, and 

C' = P £ AD (82) 

is the empirical mean interval. In other settings, the fact that this 
statistic is a global measure of departure of the distribution (used 
here only locally, over one block) may be useful in the detection of 
periodic, and other global, signals in event data. 

8 These intervals form the Voronoi tessellation of the total observation interval. See 
( [Okabe, Boots, Sugihara and Chiu 2000] ) for a full discussion of this construct, highly useful 
in spatial domains of 2, 3, or higher dimension; see also ( [Scargle 2001a[ |Scargle 2001c] ). 
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Figure 12: Voronoi cell of a photon. Three successive photon detection times 
are circles on the time axis. The vertical dotted lines underneath delineate 
the time extent (dt) of the cell and the height of the rectangle - n/dt, where 
n is the number of photons at exactly the same time (almost always 1) - is 
the local estimate of the signal amplitude. If the exposure at this time is less 
than unity, the width of the rectangle shrinks in proportion, the area of the 
rectangle is preserved, so the height increases in inverse proportion yielding a 
larger estimate of the true event rate. 
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C Other Block Fitness Functions 

This appendix describes fitness function for a variety of data modes. 


C.l Event Data: Alternate Derivation 


The Cash statistics used to derive the fitness function in Eq. (|IU]) is 
based on representation of event times as real numbers. Of course 
time is not measured with infinite precision, so it is interesting to 
note that a more realistic treatment yields the same formula. 

Typically the data systems’ finest time resolution is represented 
as an elementary quantum of time, which will be called a tick since 
it is usually set by a computer clock. Measured values are expressed 
as integer multiples of it (c/. §2.2.1 of |Scargle 1998] ). We assume 
that n m , the number of events (e.g. photons) detected in tick m 
obeys a Poisson distribution: 


Lm 


(A dt) 


n m s>—\dt 


A r ‘ 


„-A 




n„ 


(53) 


where dt is the length of the tick. The event rates A and A are counts 
per second and per tick, respectively. Time here is given in units 
such as seconds, but a representation in terms of (dimensionless) 
integer multiples of dt is sometimes more convenient. 

Due to event independence the block likelihood is the product 
of these individual factors over all ticks in the block. Assuming all 
ticks have the same length dt this is: 


l <*> = n 


(A dt) r 


m= 1 


(54) 


where M ^ is the number of ticks in block k. Note that non-events 
are included via the factor e~ Xdt for each tick with n m = 0. When 
this expression is used to compute the likelihood for the whole in- 
terval [i.c. product of the block likelihoods over all blocks of the 
model) the denominator contributes the factor 


UkUT k) nJ. n ,„nj ’ 

where on the right-hand side the product is over all the ticks in the 


60 


whole interval. For low event rates where n m never exceeds 1, this 
quantity is unity. No matter what it is a constant, fixed once and for 
all given the data; in model comparison contexts it is independent 
of model parameters and hence irrelevant. Dropping it, noting that 
nf,f=i e~ Xdt is just e ~ XM(k)dt = e -AAh fc ) Collecting together all factors 
for ticks with the same number of events eq. (T54|) simplifies to 

OO 

L W = e -XMW H(\dt) nHW W , (56) 

n= 0 

where H^ k \n) is the number of ticks in the block with n events. 
Noting that 

OO 

J2nH {k \n) = N {k) , (57) 

n = 0 

where N ^ is the total number of events in block fc, we have simply 

L™ = (A dt) NW e~ XM(k) . (58) 


In order for the model to depend on only the parameters defining 
the block edges, we need to eliminate A from eq. fl58]b One way to 
do this is to find the maximum of this likelihood as a function of A, 
which is easily seen to be at A = yielding 


L {k) 


N^dt. N (k) 
M( fc ) ' 


(59) 


E *r(k) 
k N = 

e~ N to the full model. Moving this ultimately irrelevant factor to 
the left-hand side, noting that M ^ and taking the log, we 

have for the maximum-likelihood block fitness function 


log L%l x + M k) = N {k \ \ogM k) - log M (fc) ) . (60) 

equivalent to Eq. (TT9jl . 

An alternative way to eliminate A is to marginalize it as in the 
Bayesian formalism. That is, one specifies a prior probability distri- 
bution for the parameter and integrates the likelihood in Eq. (|58l) 
times this prior. Since the current context is generic, not devoted 
to a specific application, we seek a distribution that expresses no 
particular prior knowledge for the value of A. It is well known that 
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there are several practical and philosophical issues connected with 
such so-called non-inf ormative priors. Here we adopt this simple 
flat, normalized prior: 


P(A) = { 


P (A) Ai < A < A 2 


0 otherwise ’ 

where the normalization condition yields 

P (A) = 1 = — . 

A 2 — Ai AA 

Thus eq. (158|) . with A marginalized, is the posterior probability 


(61) 


(62) 


Pfirg = !t(\dt) Nm e- XTm d\ 


Ml 

( dt \N^ r z 2 

tWKjW) Jz i z e a<, 


(63) 

(64) 


where Zi <2 — Ai^- In terms of the incomplete gamma function 


7 ( a , x ) = [ 

Jo 

we have, utilizing M k = 


^a— 1 „—z 


dz , 


(65) 


logPmarg = log^ - TV^logM^ + log[ 7 (^ W + 1, Z 2 ) - 7 (N^ + 1, Zl ) } 

( 66 ) 


The infinite range Zi = 0, z 2 = oo, gives 


logAarsloo) = + l°gr(JV ( *> + 1) - N< 1 > logM<*> , 

(67) 

This prior is unnormalized (and therefore sometimes regarded as 
improper). Technically P^' 1 approaches zero as z 2 — >■ oo, but is 
retained here in order to formally retain the scale invariance to be 
discussed at the end of this section. 


Another commonly used prior is the so-called conjugate Poisson 
distribution 

P(A) = C A“ -1 e -/3A . (68) 


As noted by [Gelm an, Carlin, Stern, and Rubin 1995 


this 


“prior 
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density is, in some sense, equivalent to a total count of a - 1 in j3 
prior observations,” a relation that might be useful in some circum- 
stances. The normalization constant C = and with this prior 
the marginalized posterior probability distribution is 

Pep — C r , (69) 

Jo 

yielding 

log Pep - log C = log r(JV<‘> + a) - (Art*) + a) log(M« + j3) . 

(70) 

Note that for a — l,/3 — 1 this prior and posterior reduce to those 
in Eqs. (28) and (29) of [ Scargle 1998 . 

Equations (HU]) . (HT7l) and (1701) are all invariant under a 

change in the units of time. The case of eq. (167|) is slightly dodgy, as 
mentioned above, but otherwise is a direct result of expressing N ^ 
and as dimensionless counts, of events and time-ticks, respec- 
tively. (Further, in the case of eq. (166|) . z± and Z 2 are dimensionless.) 
As mentioned above, the simplicity of eq. (TT9l) recommends it in gen- 
eral, but specific prior information (e.g. as represented by eq. 100]) 
may suggest use of one of the other forms. 

C.2 0-1 Event Data: Duplicate Time Tags Forbidden 

In Mode 2 duplicate time tags are not allowed, the number of events 
detected at a given tick is 0 or 1, and the corresponding tick likeli- 
hood is: 


L m = e Xdt = 1 - p n m = 0 (71) 

= 1- e~ Xdt = p n m = 1 (72) 

where A is the model event rate, in events per unit time. From the 
Poisson distribution p — 1 — e~ Xdt is the probability of an event, and 
1 — p — e~ Xdt that of no event. Note that p or A interchangeably 
specify the event rate. Since independent probabilities multiply, the 
block likelihood is the product of the tick likelihoods: 

M( fc ) 

L« = n L m = p N(t \l - p)M < ‘ 1 -»<" , (73) 

m= 1 
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where M M is the number of ticks in block k and N ^ is the number 
of events in the block. 

There are again two ways to proceed. The maximum of this 
likelihood occurs at p — and is 


r(fc) _ / ' jy(fc) , _ N ^ ' M (fc)_jy(fc) 

max 1 M( fc ) J 1 

Using the logarithm of the maximum likelihood, 


(74) 


log £<£* = JVW log(^) + (AfW - AfW)log(l - $£) 


(75) 


yields the fitness function, additive over blocks. 

As in the previous sub-section, an alternative is to marginalize 
A: 

p (k) = J L (k) P ^ dX ; ( 7 ( 3 ) 

where P( A) is the prior probability distribution for the rate param- 
eter. With the flat prior in eq. ([(TTllfl the posterior, marginalized 
over A is 

pW = P< A > Al - e-™) Ni (e-™) Mm - Nt d\ . (77) 

J Ai 


Changing variables to p — 1 — e Ac £ with dp — dt e Xdt d\, this 
integral becomes 

Srg = ^ (1 - p) Mm - Nm ~'d P , (78) 

with p 12 = 1 — e~ A i, 2 <£ and expressible in terms of the incomplete 
beta function 

B(z;a,b) = [ w a-1 (l — uf'^du (79) 

Jo 

as follows: 


logffirg - log^ = log[B(p 2 ; JVW + 1, M(* > - i\r<*>) - B(p,; JVW + 1, - JVW)] . 

(80) 

9 In [Scargle 1998] we used p as the independent variable, and chose a prior flat (constant) 
as a function of p. Here, we use a prior flat as a function of the rate parameter. 
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The case p\ = 0,p2 = 1 yields the ordinary beta function: 


logics - log-^ = log B(NW + 1, M ik) - JV<‘>) , 


( 81 ) 


differing from Eq. (21) of [Scarg kT 1998| by one in the second argu- 
ment, due to the difference between a prior flat in p and one flat in 
A. All of the equations ([75]) . (15U1) . and (1ST]) , can be used as fitness 
functions in the global optimization algorithm and, as with Mode 1, 
are invariant to a change in the units of time. 


A brief aside: one might be tempted to use intervals between 
successive events instead of the actual times, since in some sense 
they express rate information more directly. However, as we now 
prove, the likelihood based on intervals is essentially equivalent to 
that in eq. (I58]h It is a classic result [Pap oulis 1965] that intervals 
between (time-ordered) consecutive independent events (occurring 
with a probability uniform in time, with a constant rate A) are 
exponentially distributed: 

P{dt)dt = A e- Xdt U(dt)dt, (82) 

where U(x) is the unit step function: 

U(x) = 1 x > 0 
= 0 x < 0 . 

Pretend that the data consists of the inter-event intervals, and that 
one does not even know the absolute times. The likelihood of our 
constant-rate Poisson model for interval dt n > 0 is 

L n — Xe~ x dtn , (83) 


so the block likelihood is 

]y( k ) 

L = I] A e- A dt " = X N(k) e~ XM(k \ (84) 

n = 1 

the same as in eq. f[58j) . except that here is the number of 
inter-event intervals, one less than the number of events. 
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|Prahl 1996j derived a statistic for event clustering, by testing for 
significant departures from the known interval distribution, by eval- 
uating the likelihood over a restricted interval range. This statistic 
is 

= ± £ (1~), (85) 

JV A Ti<C* ° 

where A T* is the interval between events T and i + 1, N is the 
number of terms in the sum, and 

C ’ = ^T.^ T < ( 86 ) 

is the empirical mean of the relevant intervals. In some settings, the 
fact that this statistic is a global measure (as opposed to the local 
- over one block at a time - ones used here) may be useful in the 
detection of global signals, such as periodicities, in event data. 


C.3 Time-to-Spill Data 

As discussed in §2.2.3 of fScargle 199 8] , reduction of the necessary 
telemetry rate is sometimes accomplished by recording only the time 
of detection of every Sth photon, e.g. with S=64 for the BATSE 
timc-to-spill mode. This data mode has the attractive feature that 
its time resolution is greater when the source is brighter (and pos- 
sibly more active, so that more time resolution is useful). With 
slightly revised notation the likelihood in Eq. (32) of [ Scargle 19 98 
simplifies to 

L™ s = A Vii e - AM(fc) (87) 

where AW^ is the number of spill events in the block, and M ^ is 
as usual the length of the block in ticks. With N = this is 

identical to the Poisson likelihood in Eq. (l54]h and in particular the 

AW...S 

maximum likelihood is at A = ^ k) and the corresponding fitness 

function is 

log^mL.TTS - log JV = SJV$„ [ log(JV^S) - log ] (88) 


66 


just as in Eq. (TT9j) with N ^ = SN^^, and with the same property 
that the unit in which block lengths are expressed is irrelevant. 

C.4 Point measurements: Alternative Form 

An alternative form can be derived by inserting (1381) instead of (l36j) 
into the log of Eq. (130]) as in £13.31 The result is: 

logiSL = ~ (89) 

^ n 


Expanding the square gives 


log^mL = -liY,^) 2 - 2 Y,(-^)(J2 W n' X n') + {J2 W n' X n') 2 J2 1 


< 7 „ 




cr- 


(90) 


^ X ) ( 2 ) [ X ) 2( X ) ^re-£re)( X ) W n 'X n ' ) T ( X ) Wn'%n ' ) X ) 1 

(91) 

= Y, W n‘ X l - 2 {J2 W n X nY + (E®*'^) 2 ] (92) 

^ n' ®n' n n n ' 

= -^E(4-)[E W n X l ~ (J2 W nXn) 2 } (93) 

^ n' ^n' n n 


yielding 


where 


logL 


(*) 

max 


1 

2 


[En'(£)] 


cr 


X 


O X — ^ v v '^n^n) 

n n 


(94) 

(95) 


is the weighted average variance of the measured signal values in the 
block. It makes sense that the block fitness function is proportional 
to the negative of the variance: the best constant model for the 
block should have minimum variance. 
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C.5 Point measurements: Marginal Posterior, Flat Prior 

First, consider the simplest choice, the flat, unnormalizable prior 

P( A) = P* (for all values of A) , (96) 

giving equal weight to all values. The marginal posterior for block 
k is then, from Eq. (ISO]) . 


pk _ p 


(2n) 
n„ <7 


° -iy ( &zA )2 

/ e 2 t7n ’ dX 

J — oo 


(97) 


n J 


Using the definitions introduced above in eqs. (T3TT) . (l32]h and 
we have 

IV, 

,(2 7T)~ 


pK, _ p* 


Tin <*■. 


/ OO 

^—( a k \ 2 +b k \+Ci c ) 

-oo 


(98) 


n J — oo 


Using standard “completing the square,” letting 0 = y/ak ( A + ^-), 


giving 


= a fc (A + -^) 2 = a k ( A 

Zcik 

and then using 


Ah, 


a fc 4a? 


hq-yr) — OfcA' + &fcA + Cfc + 


bl 


4 a k 


r+oo 


dz 
\J Qfe 


we have 




(2jt) 


2 


ri r j 


.2 

-e' 4 “fc ' 




(99) 

(100) 

( 101 ) 


From this result, the log-posterior fitness function is 


logP^ - = log(P* 


6? 


+ (4~) _ Ck 


( 102 ) 


where 

Ni. 

A k = — — log(27r) - J2 1 °g( CT n) (103) 

and the subscript 0 refers to the fact that the marginal posterior was 
obtained with the unnormalized prior. The second and third terms 
in Eq. (11021 ) are invariant under the transformation (1421) . Further, 
since the integral of P( A) with respect to A must be dimensionless, 
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yielding a formal invariance for (1102)1 . However the prior in eq. (196)1 
is not normalizable, so that technically P* is undefined. A way to 
make practical use of this formal invariance is simply to include a 
constant P* that has the proper dimension (x _1 ). 

C.6 Point Measurements: Marginal Posterior, Normal- 
ized Flat Prior 

Marginalizing the likelihood in eq. (130)1 with the prior in eq. flbll) . 
yields for the marginal posterior for block k: 



( 104 ) 


As before 


pk _ p( A) 2 f X2 e ~(a k \ 2 +b k \+c k ) 

On J 

Now complete the square by letting z = y/ak ( A + giving 


( 105 ) 



Ck Ck 


( 106 ) 


so we have 



( 107 ) 



( 108 ) 



( 109 ) 


we have 



2 y/Chk fin 


69 


Taking the log gives the final expression 


1 ogPl -A t = + (^ - c t ) + lo^ erfo.i-erfo,) ] 

(m 

where the subscript A indicates the fact that this result is based on 
the finite-range prior in eq. (l6Tj) . Note that this htness function is 
manifestly invariant under the transformation in eq. (142]) . for the 
same reasons discussed at the end of the previous section, plus the 
invariance of Z\^- bi the limits Z\ — > — oo and z 2 —> oo, erf(z 2 ) — 
erf^) — >■ 2, and we recover eq. 01021) - but remember that in this 
limit the invariance is only formal. 


C.7 Point Measurements: Marginal Posterior, Gaussian 
Prior 


Finally, consider using the following normalized Gaussian prior for 


A: 


P(A) 


1 1 ( x ~ x 0 \2 

i=e 2y a o ; 

crov27r 


( 112 ) 


corresponding to prior knowledge that roughly speaking A most 
likely lies in the range Ao ± cr 0 , with a normal distribution. This 
prior is not to be confused with the Gaussian form for the likelihood 
in eq. (I29j) . 

Eq. (1301) . when A is marginalized with this prior, becomes 




1 r(2T 

cr 0 V 27 T n, 


l-(^) 


' -|[ a2 (^+E, 

e o 


<J n a Q 


^)+(^+E, 

0 *0 


(J r\! 


(113) 


so with 


and 



(114) 

(115) 

(116) 


and eq. (198|) is recovered, so that eq. (H02l) . with the redefined coeffi- 
cients in eqs. (II 14j) . (11151) and (11161) . gives the final htness function. 
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Any of the log fitness functions in eqs. (1941) . (I102p . or (II lip can 
be used for the point measurement data mode in this section. No 
general guidance for this depending on convenience or the kind of 
prior information for the signal parameters that makes sense. 

C.8 Data with Dispersed Measurements 

Throughout it has been presumed that two things are small com- 
pared to any relevant time scales: errors in the determination of 
times of events, and the intervals over which individual measure- 
ments are obtained as averages. These assumptions justify treat- 
ment of the corresponding data modes as points in 1 13. II and 1JT3] re- 
spectively. Below are discussions of data that are dispersed because 
of (1) random errors in event times and (2) measurements that are 
summations or averages over non-negligible intervals. Binned data, 
an example of the latter, have already been treated in H3.2I and are 
not discussed here. 

A simple ad hoc way to deal with both of these situations is 
to compute kernel functions for each data point, representing the 
window or error distribution in either of the two above contexts. 
Each such function would be centered at the corresponding mea- 
sured value, evaluated at all of the data points, and normalized to 
represent unit intensity. Each such kernel would be maximum at 
the data point at which it is centered, but distribute some weight to 
the other data cells. The sum of all of these kernels would then be 
a set of weights at each measurement, which could then be treated 
as ordinary event data but with fractional rather than unit weights. 
The ad hoc aspect of this approach lies in the way the fitness func- 
tion is extended. The following sub-sections provide more rigorous 
analysis. 

C.8.1 Uncertain Event Locations 

Timing of events is always uncertain at some level. Here we treat the 
case where the error distribution is wide enough to make the point 
approximation inappropriate. Rare for photon time series, with mi- 
crosecond timing errors, this situation is more common in other 
contexts and with other independent variables. With overlapping 
error distributions even the order of events can be uncertain. In the 
context described in < 11.41 one often wants to construct histograms 
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from measurements with errors - errors that may be different for 
each point (then called heteroscedastic errors). 

A simple modification of the fitness function described in 1 13.11 
addresses this kind of data. On the right-hand side of Eq. (HU]1 
quantifies the contribution of the individual events within block 

k. In extending the reasoning leading to this fitness function, the 

main issue concerns events with error distributions that have frac- 
tional overlap with the extent of block k - for events distributed 
entirely outside (inside) obviously contribute in no way (fully) to 
block fitness. By the law for the sum of probabilities of independent 
events, in the log-likelihood implicit in Eqs. (fT71) and (TT8|) N ^ is 
replaced by the sum of the areas under the probability distributions 
overlapping block k , namely summed over all events with 

significant contribution to block k , and is the integral of the 
overlapping part of the error distribution, a fraction between 0 and 

l. Thus we have 


1 ogL^ (A) = logX ^ p {l) - A (117) 

in place of Eq. (fTUjh with the analogous constant term on the left- 
hand side of that equation dropped. This result holds because a 
given datum falling inside and outside a block are mutually exclusive 
events. 

Implementing this relationship in the algorithm is easily accom- 
plished. For a given event and the interval assigned to it (c/. Figure 
[T2l in < 3B.5D sum the overlap fractions with that interval of all events 
- including that event itself. These quantities could be approxi- 
mated with very simple or complex quadrature schemes, depending 
on the context and the way in which the relevant distributions are 
represented. Normally the array nn_vec, as in the code fragment in 
TY1 is all l’s (or counts of events with identical time-tags there are 
any); but here replace it with these summed event weights. This 
construction automatically assigns the correct fractional weights to 
the block with 110 further alteration of the algorithm. 

C.8.2 Measurements in Extended Windows 

This section discusses the case of distributed measurements in the 
sense that the time of measurement is either uncertain or is effec- 
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tivcly an interval rather than a point. (This is different from the 
use of this term in 1 13.31 to describe the distribution of the measure- 
ment error in the dependent variable.) Measurements may refer to a 
quantity averaged over a range of values of t, not at a single time as 
hi 111 13.31 1C. 41 1C. 51 1C. 61 and 1C. 71 In the context of histograms ( fll.4p 
the measured quantity becomes the independent variable, and the 
dependent variable is an indicator marking the presence of the mea- 
surement there. In both cases the measurement can be thought of 
as distributed over an interval, not just at a point. 

In this case the data cell array would be augmented by the in- 
clusion of a window function, indicating the variation of the instru- 
mental sensitivity: 

x = {x n ,t n ,w n (t -t n )} n = 1 , 2 , . . . , N , ( 118 ) 

where w n (t) describes, for the value reported as X n , the relative 
weights assigned to times near t n . 

This is a nontrivial complication if the window functions overlap, 
but can nevertheless be handled with the same technique. 

We assume the standard piece-wise constant model of the under- 
lying signal, that is, a set of contiguous blocks: 

N b 

B(x) = Y j bU \x) ( 119 ) 

j= i 

where each block is represented as a boxcar function: 

B^ix) = { X r Cj+1 (120) 

v L 0 otherwise 

the Qj are the change-points, satisfying 

min(x n ) < Ci < C 2 < • • • Cj < O+i < ■ ■ ■ < C N b < max(x n ) (121) 

and the Bj are the heights of the blocks. 

The value of the observed quantity, y n , at x n , under this model 
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IS 


Vn = f w n (x)B(x)dx 

= fw n (x) Efil BW(x)dx 
= Efil Jw n (x)BW(x)dx 
= T ,?=1 Bj f^ +1 w n {x)dx 


(122) 


so we can write 


where 


Vn 


N b 


£ BjG^n) 

3=1 



w n (x)dx 


(123) 


(124) 


is the inner product of the rt-th weight function with the support of 
the j’-th block. The analysis in [Bretthorst 1988] shows how do deal 
with the non-orthogonality that is generally the case herein 

The averaging process in this data model induces dependence 
among the blocks. The likelihood, written as a product of likelihoods 
of the assumed independent data samples, is 


P(Data|Model) 


where 


— n^ =1 -P(j/n| Model) 

(125) 

-nl, 4— e A ( ”) 2 
Un=1 

(126) 

- n 1 A e 2 ( rTn ’ 

lln=1 

(127) 

= Qe , 

(128) 

N 1 

q = n-?— • 

n=i J2nal 

(129) 


After more algebra and adopting a new notation, symbolized by 


— ^ ~ t Vn (130) 

cri 


10 If the weighting functions are delta functions, it is easy to see that Gj (n) is non- zero if and 
only if x n lies in block j, and since the blocks do not overlap the product Gj(n)Gk(n) is zero 
for j ^ k , yielding orthogonality, Gj(n)Gk(n) = $j,k- And of course there can be some 
orthogonal blocks, for which there happens to be no “spill over”, but these are exceptions. 
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and 

— G k (n) , (131) 

we arrive at 

logP{{y n }\B) = Qe~% , (132) 

where 

N N b N N b N b N 

H=Zvl~ 2 E B iZ y»G,(n) + £E B,B k V G s (n)G t ( n) , 

n= 1 j=l n=l j= 1 fc=l n=l 

(133) 

The last two equations are equivalent to Eqs. (3.2) and (3.3) of 
|Bretthorst 1988] . so that the orthogonalization of the basis func- 
tions and the final expressions follow exactly as in that reference. 


C.9 Piecewise Linear Model: Event Data 


Here we outline the computations of a fitness function for the piece- 
wise linear model in the case of event data. This means that the 
event rate for a block is assumed to be linear, as in Eq. (JT|) . 

For convenience we take the fiducial time t to be t 2 , the time 
at the end of the block. Take t\ to be the time at the beginning, so 
M = f 2 — 1\ is the length of the block, and the signal x is A(1 — aM ) 
at the beginning of the block and A at the end, and varies linearly 
in between. 

The block likelihood for the case of event data fj is 


N k „t 2 

L(X,a) = T]log[ A(1 + a(U - t 2 )) ] - / A(1 + a(t - t 2 ))dt (134) 

i=i 


where the sum is over the Nk events in the block and the integral is 
over the time interval covered by the block. Simplifying we have 

N k 

L{ A, a ) = N k logX + ^ 1 og[ (1 + afc - t 2 )) ] - A[(l - at 2 )t + -t 2 ]g 

i= 1 z 

(135) 

Nk 

L(X, a) = N k logX + J2 l °g[ (1 + a(ti~t 2 )) } - XM k (l-^M k ) (136) 
Now let’s compute the maximum likelihood as a function of A 
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and a, starting by setting 


d L 
~d\ 


^ - M k { 1 - ^M k ) = 0 


so that at the maximum of this likelihood we have 


A 


N k 

M k ( 1 - f M k ) 


and therefore 


A(A maa; , a) N k 1 og[ 


N, 


N k 


M k ( 1 - f M k ) 


+ ^2 1 0 d[ (1 + a (U — t 2 )) } 


1=1 


dL 

da 


N k log[ 


N k 


N k 


M k { 1 - f M k ) 


+ ^2 1 og[ (1 + a(ti — t 2 )) ] — N k 


i= 1 


dL 

da 


y^ (tj ~ t 2 ) 

“t 1 + a(h> - t 2 ) 


+ 


0 



1 

N~k 

/(«) 


y^ (*i - t 2 ) 

1 + a(t* - t 2 ) (1 

1 (t» — t 2 ) 

iVfc 1 + a(ti - t 2 ) 

1 yv (b~f 2 ) 2 

" [1 + a(t* - t 2 )] 2 




|M fc ) 


= 0 


+ 


\M k 


(1 - f M k ) 

jjg 

(1 - fil4) 2 


2 ^ (tj - t 2 ) 
M 2 ^ 1 + a(ti - t 2 ) 


N k _ 

(1 - |M fc ) 

(1 - f M fc ) 
N k 


2 ^ (tj - t 2 ) 

M fc " 1 + a(ti - t 2 ) 


M k 

9 hi— *2) 

2^i=i 1+a(t ._ t2 ) 


1 - 



1 

2 




M k N k (J2 

1=1 


(U ~ t 2 ) 

1 + a(ti - t 2 ) 


-1 


(137) 

(138) 

-N„ 

(139) 

(140) 

(141) 

(142) 

(143) 

(144) 

(145) 

(146) 

(147) 

(148) 
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(149) 


2 



N k 


Nk(Z 

2—1 


(tj ~ t 2 ) 

1 + a(ti — t 2 ) 


)- 1 


C.10 Piecewise Exponential Model: Event Data 

In this case we model the signal as varying exponentially across the 
time interval contained in the block. Denoting the times beginning 
and ending the block as t\ and t- 2 , and taking the latter as the fiducial 
time in Equation (|2]1 . the signal is Xe~ aM at the beginning of the 
block and A at the end. 

Much as in < K.'.9I the block likelihood for the case of event data 
ti is the follow expression involving a sum over the N k events in the 
block and an integral over the time interval covered by the block: 


iV fc r t 2 

L( A, a) = J2 M \e a[ti ~ t2) } - / A e a{t ~ t2) dt (150) 
i = i Jtl 

1 — e~ aM 

L(A,a) = N k log\ + a^2(U-t 2 ) - A( ) (151) 

T a 

where M — t 2 — ti is the length of the block. 

Now let’s compute the maximum likelihood as a function of A 
and a: 

()L N k 1 - e~ aM 
~d\~ a 

and therefore at the maximum we have 

x aN k 
~ 1 - e~ aM 




— = ^ - t 2 ) — [iVfe(l — e~ aM )~ 1 ][(M + a~ 1 )e~ aM — a -1 ] (154) 

L max {a) = N k log( 1 _ e _ aM ) + a^(t l - t 2 ) - 1 _ e _ aM ( ) 

(155) 

Lmax(a) = N k 1 og ( - _^^ aM ) + a (U ~ *2) — N k (156) 
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dL max (a ) Ar .1 — e aM N „ ^ . 

1 ( — )Q + E & - 


<9a 


aiV fc 


( 157 ) 


where 


Q = iVfc[(l - _ a (! _ e -aM)-2 Me -aM| (1 58 ) 

<9 L max (a) N k e -aM 

-^ =— MW ‘(r^=) 


+ £ (if - tj) (159) 


To solve for the value of a that makes this derivative zero (to find 
the maximum of the likelihood) we will use Newton’s method to find 
the zeros of 

f(a) = dL ™*( a h Nk = I _ Me~ aM ( 1 - e ~ aM )~ l + 5 (160) 

da a 


where 




(161) 


is the mean of the differences between the event times and the time 
at the end of the block. The iterative equation is 


®fc + i 


and since S' is a constant we have 


/K) 

f'M 


(162) 


f(a) = — - — M[— Me~ aM (1 — e~ aM ) _1 — M e~ aM (1 — e~ aM )~ 2 e~ aM ) 

(163) 

f(a) = ~ + M 2 e-‘ M (l-e-‘ M )- 1 [l + e- aM (l-e- aM )- 1 ] (164) 


and defining 
we have 

and 


Q(a) = e~ aM (l - e~ aM )- 1 


f\o) — — 2 + M 2 Q(a)[l + Q(a) 




a k - MQ(a k ) + S 
a fc 2 + M 2 Q(a k )[ 1 + g(a^)]) 


(165) 

(166) 

(167) 
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